0% found this document useful (0 votes)

21 views41 pages

Unit-1

Uploaded by

smvb9398778939

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views41 pages

Unit-1

Uploaded by

smvb9398778939

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

UNIT I

1. Introduction ........................................................................................................................................................ 3
Why data Mining? ................................................................................................................................................... 3
What is data mining? .............................................................................................................................................. 3
2. Integration of a Data Mining System with a Data Warehouse ........................................................................ 4
3. Data Mining Primitives....................................................................................................................................... 5
An example for mining classification rules:-............................................ Error! Bookmark not defined.
4. Data & DataTypes ............................................................................................................................................... 8
4.1 Relational Databases: ....................................................................................................................................... 8
4.2 DataWarehouses:.............................................................................................................................................. 8
4.3 Transactional Databases .................................................................................................................................. 9
4.4 Advance Database – Object Relational Databases ....................................................................................... 10
4.5 Application Database: .................................................................................................................................... 10
5. Data mining functionalities ...............................................................................................................................12
5.1 Class/Concept Description: Characterization and Discrimination ..............................................................13
5.2 Mining Frequent Patterns, Associations, and Correlations ..........................................................................14
Frequent patterns...............................................................................................................................................14
Association analysis ...........................................................................................................................................14
5.3 Classification and Regression (for Predictive Analysis) ................................................................................ 15
5.4 Cluster Analysis ...............................................................................................................................................16
5.5 Outlier analysis ................................................................................................................................................16
6. Interestingness of Patterns................................................................................................................................ 17
7. A classification of data mining systems ........................................................................................................... 18
7.1 Classification according to the kinds of databases mined .............................................................................19
7.2 Classification according to the kinds of knowledge mined. ..........................................................................19
7.3 Classification based on level of abstraction. ..................................................................................................19
7.4 Classification according to the kinds of techniques utilized .........................................................................19
7.5 Classification based on data regularities. ...................................................................................................... 20
7.6 Classification according to the application adapted. .................................................................................... 20
8. Major issues in data mining ............................................................................................................................. 20
8.1 Mining methodology ........................................................................................................................................ 20
8.2 User Interaction ...............................................................................................................................................21
8.3 Efficiency and Scalability ............................................................................................................................... 22
8.4 Diversity of data types .................................................................................................................................... 22
8.5 Data mining and society ................................................................................................................................... 23
9. Data Preprocessing ........................................................................................................................................... 24
9.1 Data Cleaning .................................................................................................................................................. 25
1
9.1.1 Missing values ........................................................................................................................................... 25
9.1.2 Noisy data ................................................................................................................................................. 25
9.2 Data Integration.............................................................................................................................................. 26
9.2.1 Entity Identification Problem ................................................................................................................. 27
9.2.2 Redundancy and Correlation Analysis ................................................................................................... 27
9.2.3 Tuple Duplication .................................................................................................................................... 29
9.2.4 Data Value Conflict Detection and Resolution ...................................................................................... 29
9.3 Data Reduction ............................................................................................................................................... 29
9.3.1 Wavelet transforms .................................................................................................................................. 30
9.3.2 Principal components analysis ................................................................................................................31
9.3.3 Attribute Subset Selection ....................................................................................................................... 32
9.3.4 Regression and Log-Linear Models: Parametric Data Reduction ........................................................ 34
9.3.5 Histograms ............................................................................................................................................... 35
9.3.6 Clustering ................................................................................................................................................. 36
9.3.7 Sampling .................................................................................................................................................... 37
9.3.8 Data Cube Aggregation ........................................................................................................................... 38
9.4 Data Transformation and Data Discretization ............................................................................................. 39
9.4.1 Normalization........................................................................................................................................... 39
9.4.2 Discretization ........................................................................................................................................... 40

2
UNIT I DATA MINING

Ref Syllabus Reference Section

1 Introduction Introduction (Ch 1.1, 1.2)
2 Integration of a Data Mining System with a Data Integration of a Data Mining System with a Data
Warehouse Warehouse
3 Data Mining Task Primitives Data Mining Primitives
4 Data, Types of Data Data Objects and Attribute Types (Ch 1.3, 2.1)
5 Data Mining Functionalities Data mining functionalities (Ch 1.4.1 to 1.4.5)
6 Interestingness of Patterns Interestingness of Patterns (Ch 1.4.6, 6.3)
7 Classification of Data Mining Systems A classification of data mining systems (Ch 1.5,
1.6)
8 Issues Major ISSUES in data mining (Ch 1.7)
9 Data Preprocessing Data Preprocessing (Ch 3)

1. Introduction

Why data Mining?

 Data has grown explosively from terabytes to petabytes

– Data collection and data availability have improved – e.g. Automated data collection tools,
database systems, Web, computerized society
– Major sources of abundant data have evolved. E.g.:-

 Business: Web, e-commerce, transactions, stocks

 Science: Remote sensing, bioinformatics, scientific simulation
 Society and everyone: news, digital cameras, YouTube

 Data mining—Automated analysis of massive data sets are necessary to survive business
competition

What is data mining?

 Data mining (knowledge discovery from data) is extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns or knowledge from huge amount of data
 Alternative names

– Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern

analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
3
– Simple search and query processing, rule based expert systems etc do not use data mining

 Example: Medical Data Mining

– Health care & medical data mining – often adopted such a view in statistics and machine
learning
– Preprocessing of the data (including feature extraction and dimension reduction)
– Classification or/and clustering processes
– Post-processing for presentation

Knowledge Discovery (KDD) Process

 Data mining plays an essential
role in the knowledge discovery Pattern Evaluation
process from databases

Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
– 6

2. Integration of a Data Mining System with a Data Warehouse

 If a data mining system may or may not be integrated with a database or a data warehouse system
 The list of Integration Schemes is as follows −

– No Coupling
– Loose Coupling
– Semi−tight Coupling
– Tight coupling

4
 No Coupling − In this scheme, the data mining system does not utilize any of the database or data
warehouse functions. It fetches the data from a particular source and processes that data using some
data mining algorithms. The data mining result is stored in another file.
 Loose Coupling − In this scheme, the data mining system may use some of the functions of database
and data warehouse system. It fetches the data from the data respiratory managed by these systems and
performs data mining on that data. It then stores the mining result either in a file or in a designated
place in a database or in a data warehouse.
 Semi−tight Coupling − In this scheme, the data mining system is linked with a database or a data
warehouse system and in addition to that, efficient implementations of a few data mining primitives
can be provided in the database.
 Tight coupling − In this coupling scheme, the data mining system is smoothly integrated into the
database or data warehouse system. The data mining subsystem is treated as one functional component
of an information system.

3. Data Mining Primitives

Each user will have a data mining task in mind, that is, some form of data analysis that he or she would
like to have performed. A data mining query is defined in terms of data mining task primitives. These
primitives allow the user to interactively communicate with the data mining system during discovery in
order to direct the mining process, or examine the findings from different angles or depths.

A data mining query is defined in terms of the following primitives

1. The set of task-relevant data to be mined

2. The kind of knowledge to be mined
3. The background knowledge to be used in the discovery process
4. The interestingness measures and thresholds for pattern evaluation
5. The expected representation for visualizing the discovered patterns

5
1. The set of task-relevant data to be mined:
This specifies the portions of the database or the set of data in which the user is interested. This includes
the database attributes or data warehouse dimensions of interest (referred to as the relevant attributes or
dimensions).
2. The kind of knowledge to be mined:
This specifies the data mining functions to be per-formed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis.
3. The background knowledge to be used in the discovery process:

6
This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and
for evaluating the patterns found. Concept hierarchies are a popular form of back-ground knowledge, which
allow data to be mined at multiple levels of abstraction. User beliefs regarding relationships in the data are
another form of back-ground knowledge.
4. The interestingness measures and thresholds for pattern evaluation:
Different kinds of knowledge may have different interestingness measures. For example, interestingness
measures for association rules include support and confidence. Rules whose support and confidence values are
below user-specified thresholds are considered uninteresting.
5. The expected representation for visualizing the discovered patterns:
This refers to the form in which discovered patterns are to be displayed, which may include rules, tables,
charts, graphs, decision trees, and cubes. A data mining query language (DMQL) can be designed to
incorporate these primitives, allowing users to flexibly interact with data mining systems.

7
4. Data & DataTypes

Data & Data Types

 In Data Mining, Data means information (i.e. Data repositories) that will given to the system.
 There are different types of data repositories. They are relational databases, data warehouses,
transactional databases, advanced database systems, flat files, data streams, and the World Wide Web.
Advanced database systems include object-relational databases and specific application-oriented
databases, such as spatial databases, time-series databases, text databases, and multimedia databases.

4.1 Relational Databases:

A relational database is a collection of tables, each ofwhich is assigned a unique name.

Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples
(records or rows). Each tuple in a relational table represents an object identified by a unique key and
described by a set of attribute values.

An entity-relationship (ER) data model is used for construction of relational databases. An ER data model
represents the database as a set of entities and their relationships.

Example:

4.2 DataWarehouses:

A data warehouse is a repository of information collected from multiple sources, stored under a unified
schema, and that usually resides at a single site.

A data warehouse is usually modeled by a multidimensional database structure (shown in the figure
8
below), where each dimension corresponds to an attribute or a set of attributes in the schema, and each cell
stores the value of some aggregate measure, such as count or sales amount.

The actual physical structure of a data warehouse may be a relational data store or a multidimensional
data cube.

4.3 Transactional Databases

It consists of a file where each record represents a transaction. A transaction typically includes a unique
transaction identity number (trans ID) and a list of the items making up the transaction (such as items
purchased in a store).

Example:

9
4.4 Advance Database – Object Relational Databases

The object-relational data model inherits the essential concepts of object-oriented databases, where each
entity is considered as an object.

Data and code relating to an object are encapsulated into a single unit. Each object has associated with it
the following:

 A set of variables that describe the objects. These correspond to attributes in the entity-relationship
and relational models.
 A set of messages that the object can use to communicate with other objects, or with the rest of the
database system.
 A set of methods, where each method holds the code to implement a message.

Example
an employee class can contain variables like name, address, and birthdate. The method for the message
get_photo(employee) will retrieve and return a photo of the given employee object.

4.5 Application Database:

 A temporal database stores relational data that include time-related attributes. These attributes may
involve several timestamps, each having different semantics.
 A sequence database stores sequences of ordered events, with or without a concrete notion of time.
Examples include customer shopping sequences, Web click streams, and biological sequences.
 A time-series database stores sequences of values or events obtained over repeated measurements of
time (e.g., hourly, daily, weekly). Examples include data collected from the stock exchange, inventory
control, and the observation of natural phenomena (like temperature and wind).
 Spatial databases contain spatial-related information. Examples include geographic (map) databases,
very large-scale integration (VLSI) or computed-aided design databases, and medical and satellite

10
image databases.
 A spatial database that stores spatial objects that change with time is called a spatiotemporal
database, from which interesting information can be mined. For example, we may be able to group
the trends of moving objects and identify some strangely moving vehicles, or distinguish a bioterrorist
attack from a normal outbreak of the flu based on the geographic spread of a disease with time.
 Text databases are databases that contain word descriptions for objects. Example Web Pages, E-Mail
Messages etc.,
 Multimedia databases store image, audio, and video data. They are used in applications such as
picture content-based retrieval, voice-mail systems, video-on-demand systems, the World Wide Web,
and speech-based user interfaces that recognize spoken commands.
 A heterogeneous database consists of a set of interconnected, autonomous component databases. The
components communicate in order to exchange information and answer queries. Objects in one
component database may differ greatly from objects in other component databases, making it difficult
to assimilate their semantics into the overall heterogeneous database. Example: application of different
hardware and software.
 A legacy database is a group of heterogeneous databases that combines different kinds of data
systems, such as relational or object-oriented databases, hierarchical databases, network databases,
spreadsheets, multimedia databases, or file systems. The heterogeneous databases in a legacy database
may be connected by intra or inter-computer networks.
 Data Stream, where data flow in and out of an observation platform (or window) dynamically. Such
data streams have the following unique features: huge or possibly infinite volume, dynamically
changing, flowing in and out in a fixed order, allowing only one or a small number of scans, and
demanding fast (often real-time) response time. Typical examples of data streams include various
kinds of scientific and engineering data, time-series data, and data produced in other dynamic
environments, such as power supply, network traffic, stock exchange, telecommunications, Web click
streams, video surveillance, and weather or environment monitoring.
 The World Wide Web and its associated distributed information services, such as Yahoo!, Google,
America Online, and AltaVista, provide rich, worldwide, on-line information services, where data
objects are linked together to facilitate interactive access.

11
5. Data mining functionalities

Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks. In
general, such tasks can be classified into two categories:

 Descriptive
 Predictive

Descriptive mining tasks characterize properties of the data in a target data set.

Predictive mining tasks perform induction on the current data in order to make predictions.

Data mining functionalities, and the kinds of patterns they can discover, are described below.

1. Characterization and discrimination (Section 1.4.1)

2. Mining of frequent patterns, Associations, and Correlations (Section 1.4.2)
3. Classification and Regression (Section 1.4.3)
4. Cluster Analysis (Section 1.4.4)
5. Outlier analysis (Section 1.4.5).
6. Other Analysis (Time Series, Structure, Network)

KDD Process & Data Mining Functionalty

Input Data Data Pre- Data Post-

Processing Mining Processing

Data integration • Characterization Pattern evaluation

Normalization Pattern selection
• Association &
Feature selection correlation Pattern interpretation
Dimension reduction Pattern visualization
• Classification
• Clustering
• Outlier analysis etc

12
5.1 Class/Concept Description: Characterization and Discrimination

Data can be associated with classes or concepts. It can be useful to describe individual classes and
concepts in summarized, concise, and precise terms. Such descriptions of a class or a concept are called
class/concept descriptions.

These descriptions can be derived via

1. data characterization, by summarizing the data of the class under study (often called the target class) ,
2. data discrimination, by comparison of the target class with one or a set of comparative classes (often
called the contrasting classes),
3. both data characterization and discrimination.

Data characterization is a summarization of the general characteristics or features of a target class of

data.

 For example, to study the characteristics of software products whose sales increased by 10% in the
last year, the data related to such products can be collected by executing an SQL query.
 The output of data characterization can be presented in various forms. Examples include pie charts,
bar charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs.
 The resulting descriptions can also be presented as generalized relations or in rule form (called
characteristic rules).

Data discrimination is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes.

 For example, the user may like to compare the general features of software products whose sales
increased by 10% in the last year with those whose sales decreased by at least 30% during the same
period.
 The output of data discrimination can be presented in various forms. Examples include pie charts,
bar charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs.
 Discrimination descriptions expressed in rule form are referred to as discriminant rules.

13
5.2 Mining Frequent Patterns, Associations, and Correlations

Frequent patterns

Frequent patterns are patterns that occur frequently in data. There are many kinds of frequent patterns,
including frequent itemsets, frequent subsequences, and frequent substructures.

A frequent itemset typically refers to a set of items that often appear together in a transactional
data set — for example, milk and bread, which are frequently bought together in grocery stores by many
customers.

A frequently occurring subsequence, such as the pattern that customers tend to purchase first a PC,
followed by a digital camera, and then a memory card, is a (frequent) sequential pattern.

A substructure can refer to different structural forms, such as graphs, trees, or lattices, which may be
combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent)
structured pattern.

Association analysis

Association analysis is the discovery of association rules showing attribute-value conditions that
occur frequently together in a given set of data. Association analysis is widely used for market basket or
transaction data analysis.

It measured with two terms:

a) Support – Fraction of transactions that contain X and Y items.
b) Confidence – Measures how often items in Y appear in transactions that contain X.

Example:-

Suppose that, as a marketing manager at AllElectronics, you want to know which items are frequently
purchased together (i.e., within the same transaction). An example of such a rule is shown below:-

Buys(X, “Computer”) => buys(X, “Software”) [support =1%, confidence = 50%],

where X is a variable representing a customer.

A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that
she will buy software as well. A 1% support means that 1% of all the transactions under analysis show that

14
computer and software are purchased together. This association rule involves a single attribute or predicate
(i.e., buys) that repeats. These are referred to as single-dimensional association rules.

Dropping the predicate notation, the rule can be written simply as

Computer => Software [1%, 50%]

5.3 Classification and Regression (for Predictive Analysis)

Classification is the process of finding a model (or function) that describes and distinguishes data
classes or concepts. The model is derived based on the analysis of a set of training data (i.e., data objects
for which the class labels are known). The model is used to predict the class label of objects for which the
class label is unknown. The derived model may be represented in various forms, such as classification (IF-
THEN) rules, decision trees, mathematical formulae, or neural networks.

A decision tree is a flowchart like tree structure. Each node denotes a test on an attribute value, each branch
represents an outcome of the test, and tree leaves represent classes or class distributions. Decision trees can
be easily converted to classification rules.

A neural network is a collection of linear threshold units that can be trained to distinguish objects of
different classes.

Prediction - It is used to predict missing or unavailable numerical data values rather than class labels.
15
Example: Regression analysis

5.4 Cluster Analysis

Clustering analyzes data objects without consulting a known class label. The objects are clustered or
grouped based on the principle of maximizing the intra-class similarity and minimizing the inter-class
similarity. That is, clusters of objects are formed so that objects within a cluster have high similarity in
comparison to one another, but are very dissimilar to objects in other clusters. Each cluster that is formed can
be viewed as a class of objects, from which rules can be derived.

5.5 Outlier analysis

A data set may contain objects that do not comply with the general behavior or model of the data.
These data objects are outliers. These data objects are outliers.

Many data mining methods discard outliers as noise or exceptions. However, in some applications (e.g.,
fraud detection) the rare events can be more interesting than the more regularly occurring ones. The analysis
of outlier data is referred to as outlier analysis or anomaly mining.

Outliers may be detected using statistical tests that assume a distribution or probability model for the
data, or using distance measures where objects that are remote from any other cluster are considered outliers.

An application:- Credit card Fraud Detection, Smoke detection, etc.,

5.6 Other Methods

 Data evolution analysis

16
– Describes and models regularities or trends for objects whose behavior changes over time.
– Methods: Time-series data analysis, sequence or periodicity pattern matching, and similarity-
based data analysis.

 Structure Analysis

– Graph mining is finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)

 Information Network Analysis

– Social networks: actors (objects, nodes) and relationships (edges). e.g., author networks in CS,
terrorist networks
– Web community discovery, opinion mining, usage mining

6. Interestingness of Patterns

A data mining system has the potential to generate thousands or even millions of patterns, or rules.
However, a pattern is interesting only if it is

1. potentially useful
2. new (previously unknown)
3. easily understood by humans
4. valid on new or test data with some degree of certainty

A pattern is also interesting if it validates a hypothesis that the user sought to confirm. An interesting
pattern represents knowledge. Several objective measures of pattern interestingness exist. These are based on
the structure of discovered patterns and the statistics underlying them. An objective measure for association
rules of the form X =>Y is rule support, representing the percentage of data samples that the given rule satisfies.
Another objective measure for association rules is confidence, which assesses the degree of certainty of the
detected association. It is defined as the conditional probability that a pattern Y is true given that X is true.

More formally, support and confidence are defined as,

Support (X => Y) = Probability (XUY), and

Confidence (X => Y) = Probability {Y |X}

Over 20 interestingness measures have been proposed. Transactions that contain neither of the items A or

17
B. Null- invariance is crucial for correlation analysis. Support and confidence are not good enough to
indicate correlations. Null-invariant measures are better:-

– Kulczynski Measure
– IR (Imbalance Ratio): measure the imbalance of two itemsets A and B in rule implications

7. A classification of data mining systems

Data Mining: Confluence of Multiple Disciplines

Machine Pattern Statistics

Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance

Technology Computing

Data mining is an interdisciplinary field, the confluence of a set of disciplines including database
systems, statistics, machine learning, visualization, and information science.

Data mining systems can be categorized according to various criteria, as follows:-

1. Classification according to the kinds of databases mined

2. Classification according to the kinds of knowledge mined.
3. Classification according to the kinds of techniques utilized
4. Classification based on level of abstraction.
5. Classification based on data regularities.

18
6. Classification according to the application adapted.

7.1 Classification according to the kinds of databases mined

Database systems themselves can be classified according to different criteria (such as data models, or the
types of data or applications involved), each of which may require its own data mining technique. Data mining
systems can therefore be classified accordingly.

For instance, if classifying according to data models, we may have a relational, transactional, object-
oriented, object-relational, or data warehouse mining system. If classifying according to the special types of
data handled, we may have spatial, time-series, text, or multimedia data mining system, or a World-Wide
Web mining system. Other system types include heterogeneous data mining systems, and legacy data mining
systems.

7.2 Classification according to the kinds of knowledge mined.

Data mining systems can be categorized according to the kinds of knowledge they mine, i.e., based on data
mining functionalities, such as characterization, discrimination, association, classification, clustering,
trend and evolution analysis, deviation analysis, similarity analysis, etc.

7.3 Classification based on level of abstraction.

A comprehensive data mining system usually provides multiple and/or integrated data mining
functionalities. Moreover, data mining systems can also be distinguished based on the granularity or levels of
abstraction of the knowledge mined, including generalized knowledge (at a high level of abstraction),
primitive-level knowledge (at a raw data level), or knowledge at multiple levels (considering several levels of
abstraction). An advanced data mining system should facilitate the discovery of knowledge at multiple levels of
abstraction.

7.4 Classification according to the kinds of techniques utilized

Data mining systems can also be categorized according to the underlying data mining techniques
employed. These techniques can be described according to the degree of user interaction involved (e.g.,
autonomous systems, interactive exploratory systems, query-driven systems), or the methods of data analysis

19
employed (e.g., database-oriented or data warehouse-oriented techniques, machine learning, statistics,
visualization, pattern recognition, neural networks, and so on). A sophisticated data mining system will often
adopt multiple data mining techniques or work out an effective, integrated technique which combines the
merits of a few individual approaches.

7.5 Classification based on data regularities.

It is classified based on data regularities (commonly occurring patterns) versus those that mine data
irregularities (such as exception or outliers).

7.6 Classification according to the application adapted.

They categorized according to the applications such as finance, telecommunication, DNA, Stock markets,
email, so on..

8. Major issues in data mining

Data mining is a vigorously developing research area. The major issues in data mining can be grouped
under the following categories:-

1. Mining Methodology
2. User Interaction
3. Efficiency and Scalability
4. Diversity of data types
5. Data mining and society

8.1 Mining methodology

Mining methodologies should consider issues such as data uncertainty, noise, and incompleteness. Some
mining methods explore how user specified measures can be used to assess the interestingness of discovered
patterns as well as guide the discovery process. The following are the broad areas that come under this topic

 Mining various and new kinds of knowledge

Data mining covers a wide spectrum of data analysis and knowledge discovery tasks, from data
characterization and discrimination to association and correlation analysis, classification, regression,
20
clustering, outlier analysis, sequence analysis, and trend and evolution analysis.
 Mining knowledge in multi-dimensional space
In many cases, data can be aggregated or viewed as a multidimensional data cube. Mining knowledge
in cube space can substantially enhance the power and flexibility of data mining.
 Data mining: An interdisciplinary effort
The power of data mining can be substantially enhanced by integrating new methods from multiple
disciplines. For example, to mine data with natural language text, it makes sense to fuse data mining
methods with methods of information retrieval and natural language processing.
 Boosting the power of discovery in a networked environment
Many data objects reside in a linked or interconnected environment, whether it be the Web, database
relations, files, or documents. Semantic links across multiple data objects can be used to advantage in
data mining. Knowledge derived in one set of objects can be used to boost the discovery of knowledge
in a “related” or semantically linked set of objects.
 Handling noise, uncertainty, and incompleteness of data
Data often contain noise, errors, exceptions, or uncertainty, or are incomplete. This can lead to the
derivation of erroneous patterns. Data cleaning, data preprocessing, outlier detection and removal, and
uncertainty reasoning are examples of techniques that need to be integrated with the data mining
process.
 Pattern evaluation and pattern / constraint guided mining
Not all the patterns generated by data mining processes are interesting. What makes a pattern
interesting may vary from user to user. Therefore, techniques are needed to assess the interestingness
of discovered patterns based on subjective measures. These techniques estimate the value of patterns
with respect to a given user class, based on user beliefs or expectations. Moreover, by using
interestingness measures or user-specified constraints to guide the discovery the discovery process, we
may generate more interesting patterns and reduce the search space.

8.2 User Interaction

The user plays an important role in the data mining process. Interesting areas of research include how to
interact with a data mining system, how to incorporate a user’s background knowledge in mining, and how to
visualize and comprehend data mining results. The following are the broad areas that come under this topic

 Interactive mining
21
It is important to build flexible user interfaces and an exploratory mining environment, facilitating the
user’s interaction with the system. A user may like to first sample a set of data, explore general
characteristics of the data, and estimate potential mining results. Interactive mining should allow users
to dynamically change the focus of a search, to refine mining requests based on returned results, and to
drill, dice, and pivot through the data and knowledge space interactively, dynamically exploring “cube
space” while mining.
 Incorporation of background knowledge
Background knowledge, constraints, rules, and other information regarding the domain under study
should be incorporated into the knowledge discovery process. Such knowledge can be used for pattern
evaluation as well as to guide the search toward interesting patterns.
 Ad hoc data mining and data mining query languages: Query languages (e.g., SQL) have played an
important role in flexible searching
 Presentation and visualization of data mining results
This requires the system to adopt expressive knowledge representations, user-friendly interfaces, and
visualization techniques.

8.3 Efficiency and Scalability

The following are the broad areas that come under this topic

 Efficiency and scalability of data mining algorithms:

Data mining algorithms must be efficient and scalable in order to effectively extract information from
huge amounts of data in many data repositories or in dynamic data streams.
 Parallel, distributed, and incremental mining algorithms:
The humongous size of many data sets, the wide distribution of data, and the computational complexity
of some data mining methods are factors that motivate parallel computing. Cloud computing and
cluster computing, use computers in a distributed and collaborative way to tackle very large-scale
computational tasks. Incremental data mining incorporates new data updates incrementally to amend
and strengthen what was previously discovered.

8.4 Diversity of data types

The following are the broad areas that come under this topic

22
 Handling complex types of data
Diverse applications generate a wide spectrum of data types - temporal data, biological sequences,
sensor data, spatial data, hypertext data, multimedia data, software program code, web data, social
network data etc. Data can be structured/ unstructured; static or dynamic. Domain or application
specific data mining systems are being constructed for in-depth mining of specific kinds of data.
 Mining dynamic, networked, and global data repositories
Multiple sources of data are connected by the Internet and various kinds of networks, forming gigantic,
distributed, and heterogeneous global information systems and networks. The discovery of knowledge
from different data sources with diverse data semantics poses great challenges to data mining.

8.5 Data mining and society

The following are the broad areas that come under this topic

 Social impacts of data mining:

How can we use data mining technology to benefit society? How can we guard against its misuse? The
improper disclosure or use of data and the potential violation of individual privacy and data protection
rights are areas of concern that need to be addressed.
 Privacy-preserving data mining:
Data mining will help scientific discovery, business management, economy recovery, and security
protection (e.g., the real-time discovery of intruders and cyber attacks). The philosophy is to observe
data sensitivity and preserve people’s privacy while performing successful data mining.
 Invisible data mining:
We cannot expect everyone in society to learn and master data mining techniques. Intelligent search
engines and Internet-based stores perform such invisible data mining. unfortunately, this is done often
unknown to the user. For example, when purchasing items online, users may be unaware that the store
is likely collecting data on the buying patterns of its customers, which may be used to recommend
other items for purchase in the future.

23
9. Data Preprocessing

Data has quality if they satisfy the requirements of the intended use. There are many factors
comprising data quality, including accuracy, completeness, consistency, timeliness, believability, and
interpretability. Data preprocessing helps to achieve data quality. The major steps involved in data
preprocessing, are

1. data cleaning
2. data integration
3. data reduction
4. data transformation

Data transformation -2, 32, 100, 59, 48 => -0.02, 0.32, 1.00, 0.59, 0.48

24
9.1 Data Cleaning

In real-world data tend to be incomplete, noisy and inconsistent.

Data cleaning routines attempt to fill in missing values, smooth out noise, identify outliers, and correct
inconsistencies in the data.

9.1.1 Missing values

 Ignore the tuple, if the tuple contains several attributes with missing values
 Fill in the missing value manually: In general, this approach is time-consuming and may not be
feasible given a large data set with many missing values.
 Use a global constant to fill in the missing value: Replace all missing attribute values by the same
constant, such as a label like “Unknown". Although this method is simple, it is not recommended.
 Use the attribute mean to fill in the missing value: For example, suppose that the average income of
All Electronics customers is $28,000. Use this value to replace the missing value for income.
 Use the attribute mean of all samples belonging to the same class as the given tuple: For example, if
classifying customers according to credit risk, replace the missing value with the average income value
for customers in the same credit risk category as that of the given tuple.
 Use the most probable value to fill in the missing value: This may be determined with inference-
based tools using a Bayesian formalism or decision tree induction. For example, using the other
customer attributes in your data set, you may construct a decision tree to predict the missing values for
income.

9.1.2 Noisy data

Noise is a random error or variance in a measured variable. Noise is undesirable

i. Binning methods:

Binning methods attempt to smooth sorted data by consulting the “neighborhood" or values around it.
Because binning methods consult the neighborhood of values, they perform local smoothing. Figure
illustrates some binning techniques. Smoothing is achieved by replacing a set of neighborhood values by
mean, median, boundary values, mathematical function etc.

25
Example:-

Price data are first sorted and partitioned into equi-depth bins of depth 3. In smoothing by bin means, each
value in a bin is replaced by the mean value of the bin. The mean of the values 4, 8, and 15 in Bin 1 is 9.
Therefore, each original value in this bin is replaced by the value 9. Similarly, each bin value may be
replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary
value.

Input Data: 4,24,25,21,34,21,15,8,28 Bin boundaries Sorted Input: 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equi-width) Smoothing by bin Smoothing by bin
Bin Mean Min Max
bins means: boundaries
1 4, 8, 15 9 4 15 9, 9, 9 4, 4, 15
2 21, 21, 24 22 21 24 22, 22, 22 21, 21, 24
3 25, 28, 34 29 25 29 29, 29, 29 25, 25, 34

ii. Clustering

Similar values are organized into groups or “clusters”. Values which fall outside of the set of clusters
may be considered outliers. The following figure shows outliers detected from cluster analysis. Outliers may
be identified through a combination of computer and human inspection.

iii. Regression

Data can be smoothed by fitting the data to a function, such as regression. Linear regression involves
finding the “best" line to fit two variables, so that one variable can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more than two variables are involved
and the data are fit to a multidimensional space.

iv. Combined computer and human inspection

Detect suspicious values and have them checked by human (e.g., deal with possible outliers)

9.2 Data Integration

Data may be arriving from multiple sources in your analysis. This would involve integrating multiple
26
databases, data cubes, or files.

9.2.1 Entity Identification Problem

The data analysis task may involve data integration, which combines data from multiple sources into a
coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat
files.

Schema integration and object matching pose issues. This includes how to match attribute names,
meaning, data type, size, and range of permitted values. For example, customer identification may be referred
to as customer id in one data store and cust id in another. A name could be registered as “Bill” in one
database, “William” in another, and “B.” in a third.

9.2.2 Redundancy and Correlation Analysis

Redundancy is another important issue in data integration. An attribute may be redundant if it can be
“derived” from another attribute or set of attributes.

E.g. annual revenue can be derived from monthly revenue. Inconsistencies in attribute or dimension
naming can also cause redundancies.

Some redundancies can be detected by correlation analysis. Given two attributes or variables, such
analysis can measure how strongly one attribute implies the other, based on the available data. For nominal
data, we use the chi-square test. For numeric attributes, we can use the correlation coefficient (ρ). ρ varies
from -1 to +1. A positive value indicates the two concerned variables are positively correlated, i.e., both raise
and fall together. If ρ is zero, there is no correlation. A negative ρ implies that the raise of one variable
indicates the fall of the other

27
Simple Correlation Coefficient

• Correlation coefficient (r)

• Ranges from +1 to -1
• Perfect positive linear relationship = +1
• Perfect negative linear relationship = -1
• No correlation = 0
• Correlation coefficient for two variables (X,Y)

28
9.2.3 Tuple Duplication

Duplication means there are two or more identical tuples in the dataset for a given case. Duplicates
appear due to inaccurate data entry or update, or use of de-normalized tables.

9.2.4 Data Value Conflict Detection and Resolution

Data integration also involves the detection and resolution of data value conflicts. For example, for the
same real-world entity, attribute values from different sources may differ. This may be due to differences in
representation, scaling, or encoding. For instance, a weight attribute may be stored in metric units in one
system and British imperial units in another. Attributes may also differ on the level of abstraction.

9.3 Data Reduction

Data in data warehouse is very huge. Complex data analysis and mining on huge amounts of data can
take a long time, making such analysis impractical or infeasible. Data reduction techniques can be applied to
obtain a reduced representation with smaller data volume, maintaining the characteristics of the original data.

Data reduction strategies include:-

1. Data cube aggregation, where aggregation operations are applied to the data in the construction of
a data cube.
2. Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions
may be detected and removed.
3. Dimensionality reduction, where encoding mechanisms are used to reduce the data set size.
4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller data
representations such as parametric models (which need store only the model parameters instead of the
actual data) or nonparametric methods such as clustering, sampling, and the use of histograms.
5. Discretization and concept hierarchy generation, where raw data values for attributes are
replaced by ranges or higher conceptual levels. Data discretization is a form of numerosity reduction
that is very useful for the automatic generation of concept hierarchies.
.

29
Dimensionality reduction is the process of reducing the number of variables under consideration.
Dimensionality reduction methods include:-

 wavelet transforms
 principal components analysis

They transform or project the original data onto a smaller space. Attribute subset selection is a method of
dimensionality reduction in which irrelevant, weakly relevant or redundant attributes or dimensions are
detected and removed.

9.3.1 Wavelet transforms

The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied to
a data vector D, transforms it to a numerically different vector, D’, of wavelet coefficients. The two vectors
are of the same length. During the transformation process, store all coefficients that are higher than a
threshold and discard all others.

The DWT is closely related to the discrete Fourier transform (DFT), a signal processing technique involving
sines and cosines. In general, however, the DWT achieves better lossy compression.

The general algorithm for a discrete wavelet transform is as follows:-

 The length, L, of the input data vector must be an integer power of two. This condition can be met by
padding the data vector with zeros, as necessary.
 Each transform involves applying two functions. The first applies some data smoothing, such as a sum
or weighted average. The second performs a weighted difference.
 The two functions are applied to pairs of the input data, resulting in two sets of data of length L/2. In
general, these respectively represent a smoothed version of the input data, and the high-frequency
content of it.
 The two functions are recursively applied to the sets of data obtained in the previous loop, until the
resulting data sets obtained are of desired length.
 A selection of values from the data sets obtained in the above iterations are designated the wavelet
coefficients of the transformed data.

30
9.3.2 Principal components analysis

Principal components analysis (PCA) searches for ‘c’ k-dimensional orthogonal vectors that can best
be used to represent the data, where ‘c’ << N. The original data is thus projected onto a much smaller
space, resulting in data compression. PCA can be used as a form of dimensionality reduction. The initial data
can then be projected onto this smaller set.

PCA can be applied to ordered and unordered attributes, and can handle sparse data and skewed data.
Multidimensional data can be handled by reducing the problem to fewer dimensions. Principal components
may be used as inputs to multiple regression and cluster analysis. In comparison with wavelet transforms,
PCA tends to be better at handling sparse data, whereas wavelet transforms are more suitable for data of
higher dimensionality.

The basic procedure of PCA is as follows:-

 The input data are normalized, so that each attribute falls within the same range. This step helps ensure
that attributes with large domains will not dominate attributes with smaller domains.
 PCA computes ‘N’ orthonormal vectors which provide a basis for the normalized input data. These
are unit vectors that each point in a direction perpendicular to the others. These vectors are referred to
as the principal components. The input data are a linear combination of the principal components.
 The principal components are sorted in order of decreasing “significance" or strength. The principal
components essentially serve as a new set of axes for the data, providing important information about
variance.
31
 Since the components are sorted according to decreasing order of “significance", the size of the
data can be reduced by eliminating the weaker components, i.e., those with low variance. Using the
strongest principal components, it should be possible to reconstruct a good approximation of the
original data.

9.3.3 Attribute Subset Selection

Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining
task or redundant. For example, if the task is to classify customers based on whether or not they are likely to
purchase a popular new CD, attributes such as the customer’s telephone number are likely to be irrelevant,
unlike attributes such as age or music-taste. Although it may be possible for a domain expert to pick out
some of the useful attributes, this can be a difficult and time consuming task, especially when the data’s
behavior is not well known.

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting
probability distribution of the data classes is as close as possible to the original one using all attributes.

For n attributes, there are 2n possible subsets. An exhaustive search for the optimal subset of attributes
can be prohibitively expensive, especially when n is large. Greedy algorithms are used in this context.
Their strategy is to make a locally optimal choice in the hope that this will lead to a globally optimal solution.

32
 Stepwise forward selection:
The procedure starts with an empty set of attributes as the reduced set. The best of the original
attributes is determined and added to the reduced set. At each subsequent iteration or step, the best of
the remaining original attributes is added to the set.
 Stepwise backward elimination:
The procedure starts with the full set of attributes. At each step, it removes the worst attribute
remaining in the set.

 Combination of forward selection and backward elimination:

The stepwise forward selection and backward elimination methods can be combined so that, at each
step, the procedure selects the best attribute and removes the worst from among the remaining
attributes.
 Decision tree induction:

– Decision tree algorithms (e.g., ID3, C4.5, and CART) were originally intended for
classification. Decision tree induction constructs a flowchart like structure where each internal
(nonleaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test,
and each external (leaf) node denotes a class prediction. At each node, the algorithm chooses
the “best” attribute to partition the data into individual classes.
– When decision tree induction is used for attribute subset selection, a tree is constructed from
the given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set
of attributes appearing in the tree form the reduced subset of attributes.
33
The stopping criteria for the above methods may vary. The procedure may employ a threshold on the
measure used to determine when to stop the attribute selection process.

 Numerosity reduction techniques replace the original data volume by alternative, smaller forms of
data representation. These techniques may be parametric or nonparametric.

 For parametric methods, a model is used to estimate the data, so that typically only the data
parameters need to be stored, instead of the actual data. Parametric Methods, Examples:-

– Regression
– Log-linear models

 Nonparametric methods for storing reduced representations of the data include:-

– Histograms
– Clustering
– Sampling
– Data cube aggregation.

In data compression, transformations are applied so as to obtain a reduced or “compressed”

representation of the original data. If the original data can be reconstructed from the compressed data without
any information loss, the data reduction is called lossless. If, instead, we can reconstruct only an
approximation of the original data, then the data reduction is called lossy. (Dimensionality reduction and
numerosity reduction techniques can also be considered forms of data compression)

9.3.4 Regression and Log-Linear Models: Parametric Data Reduction

Regression and log-linear models can be used to approximate the given data.

In linear regression, the data are modeled to fit a straight line. For example, a random variable, Y (called
a response variable), can be modeled as a linear function of another random variable, X (called a predictor
variable), with the equation where the variance of Y is assumed to be constant.

Y = b + wX

Here the coefficients, w and b (called regression coefficients). These coefficients can be solved for by the
method of least squares, which minimizes the error between the actual line separating the data and the

34
estimate of the line.

Multiple regression is an extension of linear regression allowing a response variable Y to be modeled as

a linear function of a multidimensional feature vector.

Y = b + w1X1 + w2X2 + … + wnXn, where n indicates the dimensionality

Log-linear models approximate discrete multidimensional probability distributions. The method can be
used to estimate the probability of each cell in a base cuboid for a set of discretized attributes, based on the
smaller cuboids making up the data cube lattice

9.3.5 Histograms

A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or buckets. The
buckets are displayed on a horizontal axis, while the height (and area) of a bucket represents the frequency.

 Equi-width: In an equi-width histogram, the width of each bucket range is constant (such as the width
of $10 for the buckets in Figure 3.8).
 Equi-depth (or equi-height): In an equi-depth histogram, the buckets are created so that, roughly, the
frequency of each bucket is constant (that is, each bucket contains roughly the same number of
contiguous data samples).

35
 V-Optimal: If we consider all of the possible histograms for a given number of buckets, the V-optimal
histogram is the one with the least variance. Histogram variance is a weighted sum of the original
values that each bucket represents, where bucket weight is equal to the number of values in the bucket.
 MaxDiff: In a MaxDiff histogram, we consider the difference between each pair of adjacent values. A
bucket boundary is established between each pair for pairs having the largest differences, as specified
by the user
Histogram Example: The following data are a list of AllElectronics prices for commonly sold items
(rounded to the nearest dollar).
1, 8, 15, 5, 21, 20, 5, 20, 1, 5, 15, 21, 15, 20, 14, 18, 30, 14, 25, 12, 5, 10, 30, 20, 25, 15, 15, 18, 18, 10, 20,
21, 25, 18, 25, 28, 28, 20, 10, 5, 30, 18, 18, 10, 18, 25, 8, 21, 15, 20, 14, 18
The numbers are sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18,
18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

Price
Range Frequency Equal Width Histogram Equal Depth Histogram
01 to 10 13 30 30
Frequency

11 to 20 25
20
21 to 30 14 20
10
10
0
01 to 11 to 21 to
10 20 30 0
Price Range 1 to 14 15 to 20 21 to 30

9.3.6 Clustering

Clustering techniques consider data tuples as objects. They partition the objects into groups or
clusters, so that objects within a cluster are “similar" to one another and “dissimilar" to objects in
other clusters. Similarity is commonly defined in terms of how “close" the objects are in space, based on a
distance function. The quality of a cluster may be represented by its diameter, the maximum distance between
any two objects in the cluster. Centroid distance is an alternative measure of cluster quality, and is defined as
the average distance of each cluster object from the cluster centroid.

36
9.3.7 Sampling

Sampling can be used as a data reduction technique since it allows a large data set to be represented by a
much smaller random sample (or subset) of the data. Suppose that a large data set, D, contains N tuples. Let's
have a look at some possible samples for D.

1. Simple random sample without replacement of size n: This is created by drawing n of the N tuples from
D (n < N), where the probably of drawing any tuple in D is 1/N, i.e., all tuples are equally likely.
2. Simple random sample with replacement of size n: This is similar to simple random sample, except that
each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is drawn, it is
placed back in D so that it may be drawn again.
3. Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters", then a simple random
sample of m clusters can be obtained, where m < M.
4. Stratified sample: If D is divided into mutually disjoint parts called “strata", a stratified sample of D is
generated by obtaining a Simple random sample at each stratum. This helps to ensure a representative
sample, especially when the data are skewed. For example, a stratified sample may be obtained from
customer data, where stratum is created for each customer age group.

37
9.3.8 Data Cube Aggregation

 The lowest level of a data cube represents the aggregated data for an individual entity of interest

– e.g., a customer in a phone calling data warehouse.

Data Cube Aggregation

all
0-D (apex) cuboid
product date country
1-D cuboids
product,date product,country date, country
2-D cuboids

product, date, country 3-D (base) cuboid

 Multiple levels of aggregation in data cubes further reduce the size of data to deal with
 Reference appropriate levels - Use the smallest representation which is enough to solve the task
 Queries regarding aggregated information should be answered using data cube, when possible

38
9.4 Data Transformation and Data Discretization

In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Strategies for data transformation include the following:

1. Smoothing - this works to remove noise from the data. Techniques include binning, regression, and
clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and added from the
given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales
data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in
constructing a data cube for data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as 1.0 to 1.0, or
0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g.,
0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn, can be recursively
organized into higher-level concepts, resulting in a concept hierarchy for the numeric attribute.
6. Concept hierarchy generation for nominal data, where attributes such as street can be generalized to
higher-level concepts, like city or country. Many hierarchies for nominal attributes are implicit within the
database schema and can be automatically defined at the schema definition level.

9.4.1 Normalization

Here the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0, or 0 to
1.0. There are three main methods for data normalization:-

– min-max normalization
– z- score normalization
– decimal scaling.

 Min-max normalization performs a linear transformation on the original data. Suppose that minA and
maxA are the minimum and maximum values of an attribute A. Min-max normalization maps a value v
of A to v’ in the range [new minA; new maxA] by computing

39
e.g.:- (8, 10, 15, 20) transforms to (0, .16, 0.58, 1), where new min A = 0, new max A = 1

 z-score normalization (or zero-mean normalization), the values for an attribute A are normalized based on
the mean and standard deviation of A. This method of normalization is useful when the actual minimum
and maximum of attribute A are unknown, or when there are outliers which dominate the min-max
normalization. A value v of A is normalized to v’, by the following computation:-

73,600  54,000
– Ex. Let μ = 54,000, σ = 16,000. Then 73,600 maps to  1.225
16,000

 Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A. The
number of decimal points moved depends on the maximum absolute value of A. The value v of A is
normalized to v’ by computing j, where j is the smallest integer, such that,
v’ = v/ 10j, and max(|v’|) <1
 Example:- Suppose that the recorded values of A range from 986 to 917. The maximum absolute value of A
is 986. To normalize by decimal scaling, we therefore divide each value by 1000 (i.e., j D 3) so that 986
normalizes to 0.986 and 917 normalizes to 0.917.

9.4.2 Discretization

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins. This technique is
used in ‘Data Cleaning’. These methods are also used as discretization methods for data reduction and
concept hierarchy generation. Binning does not use class information and is therefore an unsupervised
discretization technique. It is sensitive to the user-specified number of bins, as well as the presence of
outliers. Binning techniques were discussed in Section 9.1.2 Noisy data, under Binning methods

40
Discretization by Histogram Analysis

Like binning, histogram analysis is an unsupervised discretization technique because it does not use class
information. Histograms were introduced in Section 9.3.5 Histograms

Discretization by cluster, Decision tree and Correlation Analysis

Cluster analysis is a popular data discretization method. A clustering algorithm can be applied to
discretize a numeric attribute, A, by partitioning the values of A into clusters or groups. Clustering
takes the distribution of A into consideration, as well as the closeness of data points, and therefore is able to
produce high-quality discretization results. Clustering can be used to generate a concept hierarchy for A by
following either a top-down splitting strategy or a bottom-up merging strategy, where each cluster forms a
node of the concept hierarchy. In the former, each initial cluster or partition may be further decomposed into
several sub clusters, forming a lower level of the hierarchy. In the latter, clusters are formed by repeatedly
grouping neighboring clusters in order to form higher-level concepts.

Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Data Mining: V Mounika Revathi Dept of Cse Sitam
No ratings yet
Data Mining: V Mounika Revathi Dept of Cse Sitam
13 pages
ICS 2408 Lecture 1 Introduction
No ratings yet
ICS 2408 Lecture 1 Introduction
32 pages
DWDM LS1 Fall 24 25
No ratings yet
DWDM LS1 Fall 24 25
42 pages
IDW Lecture 31 - Basic Concepts About Data Mining
No ratings yet
IDW Lecture 31 - Basic Concepts About Data Mining
9 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
32 pages
Data Mining
No ratings yet
Data Mining
20 pages
Comprehensive Guide to Data Mining
No ratings yet
Comprehensive Guide to Data Mining
32 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
DM Chapter 1
No ratings yet
DM Chapter 1
10 pages
DM - Unit I-Updated
No ratings yet
DM - Unit I-Updated
65 pages
Data Mining & Big Data Insights
No ratings yet
Data Mining & Big Data Insights
3 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
25 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
84 pages
My Notes DWDM
No ratings yet
My Notes DWDM
18 pages
Unit 1
No ratings yet
Unit 1
7 pages
Data Mining & Warehousing Basics
No ratings yet
Data Mining & Warehousing Basics
30 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
31 pages
Data Warehousing and Data Mining Dr.P.rizwan Ahmed
0% (1)
Data Warehousing and Data Mining Dr.P.rizwan Ahmed
20 pages
Module 1
No ratings yet
Module 1
41 pages
Data Mining - GDi Techno Solutions
No ratings yet
Data Mining - GDi Techno Solutions
145 pages
Data Mining-Introduction
No ratings yet
Data Mining-Introduction
47 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
D-Unit-1 R16
No ratings yet
D-Unit-1 R16
17 pages
01 Intro
No ratings yet
01 Intro
26 pages
DMW Notes by Me
No ratings yet
DMW Notes by Me
45 pages
Data Mining and Scientific Research
No ratings yet
Data Mining and Scientific Research
31 pages
01 Intro
No ratings yet
01 Intro
23 pages
Data Warehousing & Mining Course
No ratings yet
Data Warehousing & Mining Course
2 pages
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
No ratings yet
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
14 pages
KDD and Data Mining Explained
No ratings yet
KDD and Data Mining Explained
46 pages
1.3 What Kind of Data Can Be Mined?
No ratings yet
1.3 What Kind of Data Can Be Mined?
5 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
Introduction
No ratings yet
Introduction
27 pages
Intro of Data Mining
No ratings yet
Intro of Data Mining
27 pages
Data Mining: Techniques & Applications
No ratings yet
Data Mining: Techniques & Applications
38 pages
Syllabus CSE 7th Sem
No ratings yet
Syllabus CSE 7th Sem
3 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
18mca52c U1
No ratings yet
18mca52c U1
17 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Intro to Data Mining Course
No ratings yet
Intro to Data Mining Course
56 pages
Data Mining Notes1
No ratings yet
Data Mining Notes1
56 pages
315 DWMSyllabus
No ratings yet
315 DWMSyllabus
1 page
DM Overview
No ratings yet
DM Overview
52 pages
FALLSEM2025 26 - VL - ISWE209L - 00100 - TH - 2025 07 31 - Course Material For Module 1
No ratings yet
FALLSEM2025 26 - VL - ISWE209L - 00100 - TH - 2025 07 31 - Course Material For Module 1
31 pages
7dm Midterm Reviewer
No ratings yet
7dm Midterm Reviewer
10 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
CS-DM Module - 1
No ratings yet
CS-DM Module - 1
27 pages
Combine 056
No ratings yet
Combine 056
57 pages
Chapter 1 Intro
No ratings yet
Chapter 1 Intro
23 pages
Data Mining & Warehouse Q&A
No ratings yet
Data Mining & Warehouse Q&A
4 pages
Data Mining Functionalities Guide
No ratings yet
Data Mining Functionalities Guide
15 pages
Day-2 BE-VIII DMDW (Into. Contd..)
No ratings yet
Day-2 BE-VIII DMDW (Into. Contd..)
23 pages
Data Warehousing & Mining Overview
No ratings yet
Data Warehousing & Mining Overview
55 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
20 pages