0% found this document useful (0 votes)

50 views15 pages

DM

Uploaded by

abhishekgd2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views15 pages

DM

Uploaded by

abhishekgd2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

The Necessity of Data Mining

• Data Overload: Every day, countless terabytes and even petabytes of data are generated
from various sources, such as businesses, social networks, scientific experiments, healthcare
records, and global telecommunications. This data explosion, a direct result of widespread
computerization and advances in data storage technologies, makes it challenging to analyze
data effectively using traditional methods.

• Extracting Insights: With so much data available, valuable insights often remain hidden
within it. For instance, retail stores, large corporations, scientific research facilities, and even
government agencies generate huge data volumes. Manually analyzing this data to gain
insights, such as customer preferences or emerging trends, is impractical. Data mining meets
this need by automating the process of discovering hidden patterns, correlations, and useful
information from large datasets, transforming raw data into actionable knowledge.

2. The Shift from the Data Age to the Information Age

• Information Age Misconception: People often say we live in an "Information Age," but this is
misleading. We actually live in a "Data Age," where information still needs to be extracted
from raw data to be useful. Despite access to vast amounts of data, most of it remains
unanalyzed and unused.

• Data vs. Knowledge: Google’s Flu Trends is a classic example demonstrating how data mining
can convert raw data into valuable knowledge. By analyzing search terms related to flu
symptoms, Google could predict flu outbreaks earlier than traditional health systems. Such
insights are possible only through sophisticated data analysis that can aggregate and
interpret patterns across massive datasets. Data mining, therefore, helps to unlock this
potential, turning data into knowledge.

3. Data Mining as the Evolution of Information Technology

• Foundations in Database Development:

o 1960s: Early computing systems primarily handled simple file processing, storing
data without sophisticated structures.

o 1970s-1980s: The development of database management systems (DBMS)

introduced more organized and powerful ways of storing, querying, and managing
data. Relational databases emerged, allowing for organized data storage using
structured tables and languages like SQL for data querying.

• The Growth of Advanced Database Systems:

o Mid-1980s Onward: Database technology began supporting complex data types and
handling new forms of data like multimedia, spatial, and temporal data, expanding
the storage of information across domains.

o Late 1980s-1990s: The era of data warehousing and Online Analytical Processing
(OLAP) began, enabling businesses to store historical data for decision-making and
analysis. This led to the development of data warehouses—centralized repositories
that aggregate data from various sources.

• Emergence of Data Mining:

o 1990s-Present: With data warehouses and internet-connected global databases,
data mining became necessary to uncover patterns and insights from huge, varied
datasets. The evolution of database technology thus naturally paved the way for data
mining, marking it as a significant progression in IT.

4. Data Rich but Information Poor

• Data Tombs: The rapid accumulation of data has created a “data rich but information poor”
scenario. Vast amounts of stored data remain untouched, acting as “data tombs.” For
instance, companies may collect extensive customer transaction data but struggle to derive
meaningful insights without proper analysis tools.

• Challenges in Decision-Making: Despite access to large datasets, decision-makers often lack

the tools to analyze data efficiently. As a result, they may base decisions on intuition or
incomplete information rather than data-driven insights. This underscores the importance of
data mining, as it enables the transformation of data repositories into useful information.

5. Practical Implications of Data Mining for Decision Making

• Automated Knowledge Discovery: Unlike traditional knowledge bases or expert systems that
rely on human experts to input knowledge, data mining automates the discovery of valuable
insights. It identifies meaningful patterns, trends, and correlations in data, which can then be
used to make informed business decisions.

• Broad Applications Across Industries:

o Business: Companies can use data mining for market basket analysis, customer
segmentation, and product recommendation.

o Healthcare: Data mining is used to analyze patient records and identify trends in
health data, like potential disease outbreaks or effective treatment paths.

o Finance: Financial institutions employ data mining for fraud detection, risk
assessment, and customer behavior analysis.

o Telecommunications and Social Media: Telecom companies can predict user

behavior, optimize network usage, and improve customer service. Social media
platforms analyze interactions to improve content recommendations and detect
emerging trends.

• From Data Tombs to Golden Nuggets: Data mining effectively turns “data tombs” into
“golden nuggets” of information. By extracting valuable knowledge from raw data,
organizations are better equipped to tackle challenges, improve efficiency, and gain a
competitive advantage.
Data mining is a process that involves discovering patterns and valuable knowledge from large
datasets. It is often described as the process of finding "nuggets" of useful information from massive
amounts of raw data, much like gold mining. Despite its name, the term "data mining" may not fully
capture the essence of the process, which is more accurately a form of "knowledge discovery from
data" (KDD). This broader concept emphasizes the extraction of meaningful patterns from data
rather than just processing the data itself.

Steps in the Knowledge Discovery Process

Data mining is considered a key part of the larger knowledge discovery process, which involves
several stages. Here’s a detailed look at each stage:

1. Data Cleaning: This initial step involves removing noise and inconsistencies in the data.
Cleaning is crucial because data often comes from different sources and may contain errors,
missing values, or irrelevant information.

2. Data Integration: In this step, data from multiple sources are combined into a unified
dataset. This integration often occurs before analysis and may involve merging databases,
files, or streams into a cohesive data warehouse.

3. Data Selection: After integration, only relevant data is selected for the analysis. For instance,
if the goal is to understand customer buying behavior, only transaction data may be chosen,
leaving out irrelevant datasets.

4. Data Transformation: The selected data is transformed into a format that’s suitable for
mining. This transformation can involve summarizing or aggregating data, creating derived
variables, or normalizing values to fit the analysis methods.

5. Data Mining: This is the core step where intelligent techniques (like machine learning
algorithms) are applied to extract patterns or knowledge from the data. Techniques like
classification, clustering, and association rule mining are commonly used.

6. Pattern Evaluation: After mining, the extracted patterns are evaluated based on
"interestingness" criteria, which helps identify the most valuable insights for the end user.
Only the patterns that meet the required threshold for interestingness are retained.

7. Knowledge Presentation: The final step involves presenting the extracted knowledge to
users in a comprehensible way. Visualization techniques or summary reports are often used
here to make the insights actionable and interpretable.

Data Mining in Context

In industry and academia, data mining is sometimes used interchangeably with the entire knowledge
discovery process, due to its concise terminology. However, data mining itself is just one part of this
broader process, as it refers specifically to the step of pattern extraction.

Types of Data and Applications

Data mining can be applied to various data sources:

• Databases: Structured databases, like relational databases, serve as common data sources.

• Data Warehouses: Integrated data warehouses that store historical and aggregated data are
often mined for trends and patterns.
• The Web: Web mining is used to extract information from web pages and online resources,
providing insights into web traffic, customer behavior, and trends.

• Other Repositories: Data mining can also be performed on text, multimedia, social networks,
and streaming data.

In Summary

Data mining is the process of extracting useful patterns and insights from vast datasets, making it a
powerful tool in transforming raw data into valuable knowledge. The structured stages of data
preprocessing, mining, and evaluation within the knowledge discovery process ensure that only the
most relevant and valuable insights are obtained, making data mining a cornerstone in data-driven
decision-making across industries.
Data mining can be applied to various types of data sources, each offering unique opportunities for
extracting patterns, trends, and insights. Here are some of the key types of data that can be mined:

1. Database Data

Database Management Systems (DBMS): Databases are organized collections of data managed by
software programs (DBMS) to allow for easy storage, retrieval, and management of data. They
ensure data integrity, security, and concurrency for users, and are usually structured into tables,
known as relational databases, which are common for data mining applications.

• Relational Databases: Consist of tables (relations) where each table has rows (records) and
columns (attributes). Each row represents a unique entity, identified by a key, with attributes
providing specific information about the entity. Example databases often include entities like
customer, item, employee, and branch, and their relationships, such as purchases and sales
transactions.

• Query and Data Manipulation: Users can query relational databases using languages like
SQL, which allows for operations like selection, projection, join, and aggregate functions
(e.g., sum, average, count). For instance, queries could retrieve all items sold in the last
quarter or analyze sales patterns.

Data Mining Applications: Through data mining, trends, associations, and predictions can be
identified, such as predicting customer credit risk based on attributes like income and previous credit
information, or finding sales deviations across different time periods.

2. Data Warehouses

Definition and Purpose: A data warehouse is a centralized repository created by integrating data
from various sources, providing a unified view of data across an organization. This consolidated data
is preprocessed, transformed, and stored to support long-term analysis and decision-making.

• Multidimensional Data Structure (Data Cube): Data in warehouses is structured as a data

cube, with each dimension representing an attribute (e.g., time, location, product type), and
each cell storing aggregate measures like sum or count. This structure allows for
multidimensional analysis and summarization of data, useful for Online Analytical Processing
(OLAP).

• OLAP Operations: OLAP tools support operations like drill-down (viewing more detailed
data, such as monthly instead of quarterly) and roll-up (viewing more summarized data, like
sales per region instead of city). These operations allow users to analyze data from different
perspectives.

Data Mining Applications: In data warehouses, multidimensional data mining enables the discovery
of patterns across combinations of dimensions. This style of mining supports in-depth exploratory
analysis to identify patterns at various levels of detail, such as sales trends per region or seasonality
effects.

3. Transactional Data

Characteristics: Transactional databases capture individual transactions, such as purchases,

bookings, or web interactions. Each transaction is uniquely identified and consists of a list of items
associated with the transaction, along with other details like time, payment method, and employee
handling.
Example: In a retail setting, a transactional database may record each sale, listing items bought
together by each customer. This structure enables analyses like identifying frequently co-purchased
items (market basket analysis).

Data Mining Applications: Transactional data mining is particularly valuable for association analysis,
such as frequent itemset mining. This involves finding sets of items that are frequently purchased
together, enabling strategies like bundling products (e.g., offering discounts on printers when a
computer is purchased) to boost sales.

4. Advanced and Emerging Data Types

Beyond traditional databases and data warehouses, data mining can be applied to more complex
forms of data:

• Data Streams: Continuous flows of data, like sensor data or network traffic, which require
real-time processing.

• Ordered or Sequence Data: Such as time-series or log data, used for trend analysis and
sequence prediction.

• Graph or Networked Data: Data representing relationships, like social networks, which
enable analysis of connections, influence, and community structures.

• Spatial Data: Data with geographic components, like maps or location-based services, useful
in applications such as geospatial analysis.

• Text Data: Unstructured data from documents, emails, or web pages, suitable for text mining
and natural language processing.

• Multimedia Data: Includes images, audio, and video, often used in media content analysis.

• Web Data: Information from web pages and user interactions, mined for insights on
browsing behavior, user preferences, and trends.

Class/Concept Description: This involves summarizing the characteristics of data within a specific
class or concept. For example, summarizing customer behavior in a retail setting by categorizing
them as high spenders or low spenders. Key techniques used here are:

• Data Characterization: Summarizes the general features of a target class, often with
statistical summaries or data visualizations.

• Data Discrimination: Compares features between contrasting classes to find distinct

characteristics.

Mining Frequent Patterns, Associations, and Correlations: This functionality identifies recurring
relationships and patterns within the data. Techniques include:

• Frequent Itemset Mining: Identifies items that frequently appear together, like the common
purchase combination of bread and milk in retail.
• Association Rule Mining: Establishes rules, like “if X then Y,” that reveal item relationships
with statistical measures of support and confidence.

• Correlation Analysis: Discovers interesting statistical dependencies between variables.

Classification and Regression: Both are used for predictive analysis:

• Classification: Assigns items to predefined classes by creating a model based on known

examples. Decision trees and neural networks are common techniques.

• Regression: Predicts a continuous value rather than a class label, often using linear
regression or other numerical prediction methods.

Cluster Analysis: Groups data objects based on their similarities, without needing predefined
labels. Clustering algorithms group items in a way that maximizes similarity within groups and
minimizes similarity between groups. This is often used in market segmentation.

Outlier Analysis: Identifies data points that deviate significantly from the majority of data,
potentially signaling errors, fraud, or unique cases.

Pattern Interestingness: This evaluates whether patterns discovered are useful and relevant.
Interesting patterns are those that are actionable, novel, or useful for decision-making.

1. Class/Concept Description:

o Description: Used to provide an overview or summary of data characteristics and

can also be used to differentiate data.

o Example: A retail store wants to understand its customer base better. It uses data
characterization to describe the features of its high-spending customers, noting
characteristics like average age, income, preferred shopping times, and frequently
purchased items. Through data discrimination, it compares these high spenders to
average customers, highlighting distinct behaviors, like high spenders preferring
luxury items or shopping online more often.

2. Mining Frequent Patterns, Associations, and Correlations:

o Description: Identifies items or events that frequently occur together, which can
reveal hidden relationships in data.

o Example: A supermarket analyzes shopping cart data and finds that "milk" and
"bread" are frequently purchased together. An association rule could be: “If a
customer buys milk, there is an 80% chance they will also buy bread.” This insight
could lead to strategic placement of these items to boost sales or create promotional
bundles.

3. Classification and Regression:

o Description: Used to categorize data into predefined classes (classification) or to

predict a continuous value based on input variables (regression).

o Classification Example: A bank uses classification to assess loan applications by

categorizing applicants into “high-risk” and “low-risk” groups based on attributes like
income, credit history, and employment status. By doing this, the bank can make
informed decisions on whom to approve or deny.
o Regression Example: A real estate company wants to predict house prices based on
factors like square footage, location, and number of bedrooms. Using regression,
they develop a model to predict house prices and assist in pricing strategies or
investment decisions.

4. Cluster Analysis:

o Description: Identifies clusters or groups of data that are similar to each other but
different from other clusters, which is helpful for segmentation.

o Example: A telecom company clusters customers based on usage patterns, grouping

them into categories like “heavy data users,” “voice call heavy users,” and “mixed-
use customers.” This helps tailor marketing strategies, create targeted promotions,
and improve customer retention by understanding different needs.

5. Outlier Analysis:

o Description: Detects data points that significantly differ from others, which can
indicate fraud, errors, or unique occurrences.

o Example: A credit card company uses outlier analysis to flag unusual transactions,
such as a sudden large purchase in a foreign country by a customer who normally
only makes small local purchases. This can help prevent fraud by prompting a
security check or notifying the customer.

6. Evolution Analysis:

o Description: Studies patterns or trends in data over time, enabling the detection of
changes or prediction of future behaviors.

o Example: An e-commerce platform tracks the popularity of various product

categories over time. By observing these trends, it notices a rising interest in eco-
friendly products. The company can use this insight to adjust its inventory, promote
relevant items, or expand the eco-friendly product line.

In data mining, not all patterns generated are necessarily useful or interesting, and identifying truly
interesting patterns is crucial. Here's an overview of what makes a pattern interesting and the
methods used to focus on valuable patterns:

1. What Makes a Pattern Interesting?

A pattern is considered interesting if it meets the following criteria:

• Easily understood by humans: Patterns should be simple and interpretable.

• Valid on new/test data: Patterns should be reliable and maintain accuracy across different
datasets.

• Potentially useful: Patterns should offer insights that can guide decision-making.

• Novelty: New or unexpected insights make patterns more valuable.

• Hypothesis Confirmation: Patterns are also interesting if they validate a specific hypothesis a
user is trying to test.

2. Objective Measures of Pattern Interestingness

Objective measures are quantifiable metrics that determine a pattern’s significance:

• Support: For association rules, this is the percentage of transactions that contain a particular
rule. For example, if 30% of transactions include both "milk" and "bread," the support for the
rule “milk → bread” is 30%.

• Confidence: Indicates the likelihood that a transaction containing one item (e.g., "milk") will
also contain another (e.g., "bread"). If 70% of transactions containing "milk" also contain
"bread," the confidence is 70%.

• Accuracy: For classification rules, accuracy measures the percentage of instances correctly
classified.

• Coverage: Similar to support, it measures the percentage of data instances to which a rule
applies.

• Complexity: Measured as the simplicity or bit length of patterns, with simpler patterns often
preferred.

Example: In association rule mining, support and confidence help filter out patterns that are
statistically irrelevant or likely due to chance. A threshold (e.g., minimum confidence of 50%) can
eliminate patterns with lower relevance.

3. Subjective Measures of Pattern Interestingness

These depend on user context and are based on:

• Unexpectedness: Patterns that contradict user beliefs can reveal new insights.

• Actionability: Patterns that suggest strategic actions. For example, a pattern suggesting that
small earthquake clusters often precede a large quake can be actionable for disaster
preparedness.

• Hypothesis Confirmation: Patterns confirming user assumptions or hunches are considered

valuable.

Example: A pattern showing frequent shoppers’ characteristics might be interesting to a marketing

manager but irrelevant to an HR analyst.

4. Can a Data Mining System Generate All Interesting Patterns?

This question of completeness addresses whether data mining algorithms can generate all relevant
patterns:

• Generating every possible pattern is typically impractical due to the sheer volume.

• User-defined constraints and interestingness thresholds can help focus on relevant patterns,
improving efficiency.

5. Can a Data Mining System Generate Only Interesting Patterns?

Generating only interesting patterns remains a challenge and an area of optimization in data mining.
By using interestingness measures, systems can prioritize the discovery and ranking of patterns likely
to be relevant, filtering out less useful ones.
These measures, combined with constraints, play a crucial role in refining the search space and
guiding the data mining process, which enhances both efficiency and relevance.
Data preprocessing is a critical step in data mining and knowledge discovery, as it helps enhance the
quality of data and thereby improves the effectiveness of the analysis. Here’s a breakdown of why
data preprocessing is essential and the tasks involved:

1. Importance of Data Quality

Data quality refers to how well data meets the requirements of its intended use, encompassing
factors like:

• Accuracy: Correctness of the data values.

• Completeness: Availability of all necessary data attributes and values.

• Consistency: Uniformity in data, avoiding discrepancies in coding and formats.

• Timeliness: Data’s relevance and update frequency.

• Believability: How much users trust the data.

• Interpretability: How easy the data is to understand.

For example, as a sales manager at AllElectronics, you might notice missing values, inconsistent
department codes, or out-of-date customer addresses, each affecting data quality. While a marketing
analyst may overlook some address inaccuracies, a sales report manager might consider even minor
inconsistencies significant.

2. Key Steps in Data Preprocessing

Data Cleaning

Data cleaning addresses issues like missing, noisy, or inconsistent data. Techniques include:

• Filling Missing Values: Using methods like mean imputation or predictive modeling.

• Smoothing Noisy Data: Techniques such as binning, regression, or clustering can handle
errors or deviations in data.

• Outlier Detection and Removal: Removing or correcting anomalous data points.

• Consistency Checking: Resolving discrepancies, such as different codes for the same item
across databases.

Example: If sales data has missing “promotion status” for certain items, filling those values ensures
accurate analysis.

Data Integration

Data integration combines data from multiple sources, often necessary when using data warehouses.
Challenges include:

• Schema Matching: Aligning attributes with different names or formats across databases
(e.g., "cust_id" vs. "customer_id").

• Eliminating Redundancy: Identifying and removing duplicate records or values.

Example: Combining customer information from multiple branches of AllElectronics while resolving
attribute name inconsistencies (like “William” vs. “Bill”).
Data Reduction

Data reduction decreases data size while preserving essential information, making analysis faster and
more efficient. Methods include:

• Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) to reduce the
number of features.

• Numerosity Reduction: Using parametric models (e.g., regression) or nonparametric

methods (e.g., clustering) to represent data more compactly.

Example: Reducing a large dataset by aggregating sales by quarter instead of by individual

transaction.

Data Transformation

Transformation involves converting data into a suitable format for mining:

• Normalization: Scaling data to a specified range, like [0, 1], to make attributes comparable.

• Discretization: Converting continuous data into discrete intervals or categories (e.g., age
ranges into “youth,” “adult,” and “senior”).

• Concept Hierarchy Generation: Transforming detailed data into higher-level concepts for
abstract analysis.

Example: Scaling “age” and “annual salary” attributes to the same range to improve distance-based
analysis accuracy.

Summary

Real-world data are often incomplete, inconsistent, and contain errors, making preprocessing
essential for high-quality analysis. By ensuring data quality through preprocessing tasks,
organizations can derive more accurate, efficient, and reliable insights for decision-making.

Data cleaning, or data cleansing, is a critical step in data preprocessing that addresses incomplete,
noisy, or inconsistent data. Here’s an overview of its main concepts:

3.2.1 Handling Missing Values

Missing data values can hinder analysis, and there are several methods for managing them:

1. Ignoring the Tuple: Useful when many attributes are missing, though this discards potentially
useful information.

2. Filling Manually: This is labor-intensive and impractical for large datasets.

3. Global Constant Replacement: Missing values can be filled with a constant like "Unknown,"
though it may skew analysis by introducing artificial patterns.

4. Using Central Tendencies (Mean or Median): Based on data distribution, mean or median
values can fill gaps, assuming central tendencies are suitable replacements.
5. Class-Specific Mean/Median: For attributes within a particular class, the class mean or
median may be a more accurate fill.

6. Most Probable Value Prediction: Techniques like regression or decision trees can predict
missing values based on existing data patterns.

Some methods may introduce biases, but using related attributes often yields better estimates,
preserving attribute relationships.

3.2.2 Smoothing Noisy Data

Noise refers to random errors or variances in data. To manage this, several smoothing techniques are
applied:

1. Binning: Divides sorted data into bins. Smoothing can be done by bin means, medians, or
boundaries, which smooth data based on neighboring values.

2. Regression: Fits data to a function, like a line (linear regression) or a surface (multiple
regression), based on relationships among attributes.

3. Outlier Analysis: Clustering can identify outliers, which may represent noise.

These methods also reduce distinct attribute values, aiding in data discretization.

3.2.3 Data Cleaning as a Process

Cleaning data involves discrepancy detection and data transformation:

• Discrepancy Detection: Identifying inconsistencies due to human error, coding errors, or

system issues. Metadata helps detect anomalies through statistical descriptions,
unique/consecutive/null rules, and dependency checks.

• Data Transformation: Tools for transformation include data scrubbing (basic error detection
and correction), data auditing (rule violation detection), and data migration/ETL tools for
transformation specification.

Increasingly, data cleaning tools emphasize interactivity. For example, Potter’s Wheel provides a step-
by-step interface, allowing real-time transformation checks. Declarative languages for data
transformation are also emerging, enhancing interactivity and efficiency in data cleaning processes.

These strategies ensure data reliability, though the process can be iterative and time-consuming,
requiring updates to metadata to ease future data cleaning.
Data Integration

Data mining often requires data integration, which is the process of merging data from multiple
sources. Effective integration helps reduce redundancies and inconsistencies in the resulting dataset,
thereby improving the accuracy and efficiency of subsequent data mining tasks.

Challenges in Data Integration

The integration process is complicated by semantic heterogeneity and the structural differences
between various data sources. Key issues include:

• Schema Integration: Aligning the data structures and formats from different sources.

• Entity Identification: Determining how to match equivalent real-world entities across

multiple datasets (e.g., identifying whether "customer id" and "cust number" refer to the
same attribute).

Metadata plays a crucial role in schema integration and data cleaning, helping to identify and
transform attributes accurately. For instance, different coding schemes for payment types may need
to be standardized for effective integration.

3.3.1 Entity Identification Problem

During data integration, it is essential to accurately match equivalent entities from various data
sources, such as databases, data cubes, or flat files. This process involves careful consideration of:

• Attribute Metadata: Attributes may have different names, meanings, data types, and
permissible value ranges. Understanding this metadata helps avoid errors during integration.

• Functional Dependencies and Referential Constraints: Ensuring that these constraints are
preserved across different systems is critical. For example, the way discounts are applied (to
the order versus individual line items) must be consistent to avoid errors in the target
system.

3.3.2 Redundancy and Correlation Analysis

Redundancies may arise when one attribute can be derived from others. To detect redundancy,
correlation analysis can be employed:

• Nominal Data: For nominal attributes, a chi-square test can determine the correlation
between two attributes. This test assesses the independence of two categorical variables by
comparing observed frequencies against expected frequencies derived from their
distribution.

• Numeric Data: For numeric attributes, the correlation coefficient (Pearson’s r) measures the
strength of the relationship between two variables. A high positive or negative value
indicates a strong correlation, suggesting potential redundancy.

3.3.3 Tuple Duplication

Tuple duplication is another critical issue in data integration. It occurs when identical records exist for
a unique data entry. Common causes include:

• Denormalized Tables: Used to enhance performance, they may lead to redundancies if not
managed properly.
• Data Entry Errors: Inconsistent data updates can create discrepancies in identical records,
such as different addresses for the same purchaser in a database.

3.3.4 Data Value Conflict Detection and Resolution

Data integration must also address conflicts in data values, which can occur due to differences in
representation or encoding between data sources. Common examples include:

• Unit Differences: A weight attribute may be recorded in metric units in one source and
imperial units in another.

• Grading Systems: Different educational institutions may use varying grading schemes,
complicating data exchanges.

• Abstraction Levels: Attributes recorded at different levels of detail can lead to confusion,
such as total sales figures representing different scopes across systems.

By employing effective detection and resolution strategies, data integration can be streamlined,
leading to more reliable datasets for analysis.

Conclusion

Effective data integration is crucial for successful data mining. It requires careful consideration of
schema matching, entity identification, redundancy analysis, and conflict resolution to create a
coherent and accurate dataset from multiple sources.

Unit 3
No ratings yet
Unit 3
22 pages
Data Mining
No ratings yet
Data Mining
395 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
Data Mining
No ratings yet
Data Mining
7 pages
Motivation of Data Mining
No ratings yet
Motivation of Data Mining
4 pages
1 - Lect 1 & 2 Data Mining
No ratings yet
1 - Lect 1 & 2 Data Mining
20 pages
Data Mining in Search Engine Analytics
No ratings yet
Data Mining in Search Engine Analytics
7 pages
Introduction To Data Mining - 125604
No ratings yet
Introduction To Data Mining - 125604
7 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
15 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Data Mining Cognate
No ratings yet
Data Mining Cognate
23 pages
DWDM 2
No ratings yet
DWDM 2
15 pages
Data Mining Notes
No ratings yet
Data Mining Notes
46 pages
Data Mining: Applications and Techniques
No ratings yet
Data Mining: Applications and Techniques
60 pages
IT in Society - Data Mining
No ratings yet
IT in Society - Data Mining
22 pages
Data Mining
No ratings yet
Data Mining
18 pages
BIDW Lecture 2
No ratings yet
BIDW Lecture 2
33 pages
Chapter 1 (Introduction)
No ratings yet
Chapter 1 (Introduction)
17 pages
Data Mining for Business Insights
No ratings yet
Data Mining for Business Insights
5 pages
IT in Society On Data Mining
No ratings yet
IT in Society On Data Mining
22 pages
Data Mining and Warehousing-1
No ratings yet
Data Mining and Warehousing-1
43 pages
Module1 1 Introduction
No ratings yet
Module1 1 Introduction
27 pages
Data Mining L1,2
No ratings yet
Data Mining L1,2
26 pages
Notes For DMDWH - Module1
No ratings yet
Notes For DMDWH - Module1
21 pages
DM Module1
No ratings yet
DM Module1
15 pages
Internal PPT - Applications and Trends in Data Mining
No ratings yet
Internal PPT - Applications and Trends in Data Mining
17 pages
Unit 1
No ratings yet
Unit 1
27 pages
Big Data & Cloud Computing CME Unit 1
No ratings yet
Big Data & Cloud Computing CME Unit 1
23 pages
What Is Data Mining
No ratings yet
What Is Data Mining
8 pages
Unit-1 (Data Mining)
No ratings yet
Unit-1 (Data Mining)
13 pages
Data Mining Tutorial Guide
No ratings yet
Data Mining Tutorial Guide
30 pages
Data Mining: by Doug Alexander
No ratings yet
Data Mining: by Doug Alexander
6 pages
Data Mining OVERVIEW
No ratings yet
Data Mining OVERVIEW
8 pages
DM Mod 1
No ratings yet
DM Mod 1
17 pages
A Techinical Paper: Tupimakadia1@yahoo - Co.in Yamu - 4u1985@yahoo - Co.in
No ratings yet
A Techinical Paper: Tupimakadia1@yahoo - Co.in Yamu - 4u1985@yahoo - Co.in
14 pages
Data Mining M1
No ratings yet
Data Mining M1
64 pages
Haramaya University College of Engineering and Technology Department of Information Technology
No ratings yet
Haramaya University College of Engineering and Technology Department of Information Technology
38 pages
Data Mining Unit 1 (MSC Ds 3 Sem)
No ratings yet
Data Mining Unit 1 (MSC Ds 3 Sem)
119 pages
Data Mining for Business Growth
No ratings yet
Data Mining for Business Growth
7 pages
Data Mining: The Basic Concept
No ratings yet
Data Mining: The Basic Concept
23 pages
Data Mining Note
No ratings yet
Data Mining Note
79 pages
Fundamental of Data Mining (CSI-508) .
No ratings yet
Fundamental of Data Mining (CSI-508) .
19 pages
Module 3
No ratings yet
Module 3
187 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
No ratings yet
Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
66 pages
Data Mining for Business Insights
100% (1)
Data Mining for Business Insights
39 pages
Data Mining Essential Concepts For Analytics (DR K Seefeld) (Z-Library)
No ratings yet
Data Mining Essential Concepts For Analytics (DR K Seefeld) (Z-Library)
168 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
46 pages
NoteGPT AI PPT Maker 1728839183012
No ratings yet
NoteGPT AI PPT Maker 1728839183012
18 pages
DM-Unit 1
No ratings yet
DM-Unit 1
13 pages
Data Science Module 1 Notes
No ratings yet
Data Science Module 1 Notes
16 pages
MUAZ
No ratings yet
MUAZ
21 pages
KM Notes Unit-3
No ratings yet
KM Notes Unit-3
20 pages
Notes DATA MINING MBA III
No ratings yet
Notes DATA MINING MBA III
8 pages
Major Issues in Data Mining
75% (4)
Major Issues in Data Mining
45 pages
Lps Week 16 Iatb
No ratings yet
Lps Week 16 Iatb
5 pages
01 Intro
No ratings yet
01 Intro
23 pages
Data Mining
No ratings yet
Data Mining
8 pages
8 Data Mining and Warehousing
No ratings yet
8 Data Mining and Warehousing
171 pages
Swami Keshvanand Institute of Technology, Management & Gramothan, Jaipur Department of Electrical Engineering
100% (1)
Swami Keshvanand Institute of Technology, Management & Gramothan, Jaipur Department of Electrical Engineering
18 pages
Space and Science Riddles
No ratings yet
Space and Science Riddles
4 pages
Neurath - Pseudorationalism
No ratings yet
Neurath - Pseudorationalism
11 pages
Preparation of Uric Acid Standard Stock Solution
No ratings yet
Preparation of Uric Acid Standard Stock Solution
2 pages
Sallen-Key 4th Order Low-Pass Filter
No ratings yet
Sallen-Key 4th Order Low-Pass Filter
11 pages
Iso 11171 1999
No ratings yet
Iso 11171 1999
15 pages
HVAC/Plumbing QC Interview Guide
No ratings yet
HVAC/Plumbing QC Interview Guide
23 pages
39
No ratings yet
39
79 pages
Trabajo Final Ingles Tecnico
100% (1)
Trabajo Final Ingles Tecnico
73 pages
1 Development Platform
No ratings yet
1 Development Platform
12 pages
Optical Wireless Communications For Beyond 5G Networks and IoT - Unit 12 - Week 8
No ratings yet
Optical Wireless Communications For Beyond 5G Networks and IoT - Unit 12 - Week 8
4 pages
S RV Calculator
No ratings yet
S RV Calculator
4 pages
TCP Server Flowchart & DNS Records
No ratings yet
TCP Server Flowchart & DNS Records
2 pages
E2P/E2K Pump Parts Guide
No ratings yet
E2P/E2K Pump Parts Guide
10 pages
Infinity Square Creepy Skull
100% (2)
Infinity Square Creepy Skull
13 pages
CDKB Case Study
No ratings yet
CDKB Case Study
3 pages
ACCA MA1 Syllabus and Study Guide - Final
No ratings yet
ACCA MA1 Syllabus and Study Guide - Final
4 pages
Graham Giller Wilmott Talk
No ratings yet
Graham Giller Wilmott Talk
31 pages
Specific Heat Lab
No ratings yet
Specific Heat Lab
2 pages
FAQs S7 1200
No ratings yet
FAQs S7 1200
82 pages
Standard Equipment: VHP Gas Engine Series
No ratings yet
Standard Equipment: VHP Gas Engine Series
2 pages
LEGO NXT Hardware Installation Guide
No ratings yet
LEGO NXT Hardware Installation Guide
7 pages
Ultra Low Power 3-Pin Voltage Surveillance With Time-Out: em Microelectronic-Marin Sa
No ratings yet
Ultra Low Power 3-Pin Voltage Surveillance With Time-Out: em Microelectronic-Marin Sa
4 pages
Addition of Polynomials
No ratings yet
Addition of Polynomials
30 pages
Odv-065r17e17k17k DS 0-0-2
No ratings yet
Odv-065r17e17k17k DS 0-0-2
1 page
Chesterton API Piping Plans
No ratings yet
Chesterton API Piping Plans
11 pages
Per g26 Pub 32044 Touchstone AssessmentQPHTMLMode1 32044O251 32044O251S1D8363 1747739640072993 252081098 32044O251S1D8363E1.html#
No ratings yet
Per g26 Pub 32044 Touchstone AssessmentQPHTMLMode1 32044O251 32044O251S1D8363 1747739640072993 252081098 32044O251S1D8363E1.html#
39 pages
An Introduction To Digital Design Using A
No ratings yet
An Introduction To Digital Design Using A
30 pages
Tutorial
No ratings yet
Tutorial
7 pages

DM

Uploaded by

DM

Uploaded by

The Necessity of Data Mining

2. The Shift from the Data Age to the Information Age

3. Data Mining as the Evolution of Information Technology

• Foundations in Database Development:

o 1970s-1980s: The development of database management systems (DBMS)

• The Growth of Advanced Database Systems:

• Emergence of Data Mining:

4. Data Rich but Information Poor

• Challenges in Decision-Making: Despite access to large datasets, decision-makers often lack

5. Practical Implications of Data Mining for Decision Making

• Broad Applications Across Industries:

o Telecommunications and Social Media: Telecom companies can predict user

Steps in the Knowledge Discovery Process

Data Mining in Context

Types of Data and Applications

Data mining can be applied to various data sources:

• Multidimensional Data Structure (Data Cube): Data in warehouses is structured as a data

Characteristics: Transactional databases capture individual transactions, such as purchases,

4. Advanced and Emerging Data Types

• Data Discrimination: Compares features between contrasting classes to find distinct

• Correlation Analysis: Discovers interesting statistical dependencies between variables.

Classification and Regression: Both are used for predictive analysis:

• Classification: Assigns items to predefined classes by creating a model based on known

o Description: Used to provide an overview or summary of data characteristics and

2. Mining Frequent Patterns, Associations, and Correlations:

3. Classification and Regression:

o Description: Used to categorize data into predefined classes (classification) or to

o Classification Example: A bank uses classification to assess loan applications by

o Example: A telecom company clusters customers based on usage patterns, grouping

o Example: An e-commerce platform tracks the popularity of various product

1. What Makes a Pattern Interesting?

A pattern is considered interesting if it meets the following criteria:

• Easily understood by humans: Patterns should be simple and interpretable.

• Novelty: New or unexpected insights make patterns more valuable.

2. Objective Measures of Pattern Interestingness

3. Subjective Measures of Pattern Interestingness

These depend on user context and are based on:

• Hypothesis Confirmation: Patterns confirming user assumptions or hunches are considered

Example: A pattern showing frequent shoppers’ characteristics might be interesting to a marketing

4. Can a Data Mining System Generate All Interesting Patterns?

5. Can a Data Mining System Generate Only Interesting Patterns?

1. Importance of Data Quality

• Accuracy: Correctness of the data values.

• Completeness: Availability of all necessary data attributes and values.

• Consistency: Uniformity in data, avoiding discrepancies in coding and formats.

• Timeliness: Data’s relevance and update frequency.

• Believability: How much users trust the data.

• Interpretability: How easy the data is to understand.

2. Key Steps in Data Preprocessing

• Outlier Detection and Removal: Removing or correcting anomalous data points.

• Eliminating Redundancy: Identifying and removing duplicate records or values.

• Numerosity Reduction: Using parametric models (e.g., regression) or nonparametric

Example: Reducing a large dataset by aggregating sales by quarter instead of by individual

Transformation involves converting data into a suitable format for mining:

3.2.1 Handling Missing Values

2. Filling Manually: This is labor-intensive and impractical for large datasets.

3.2.2 Smoothing Noisy Data

3.2.3 Data Cleaning as a Process

Cleaning data involves discrepancy detection and data transformation:

• Discrepancy Detection: Identifying inconsistencies due to human error, coding errors, or

Challenges in Data Integration

• Entity Identification: Determining how to match equivalent real-world entities across

3.3.1 Entity Identification Problem

3.3.2 Redundancy and Correlation Analysis

3.3.3 Tuple Duplication

3.3.4 Data Value Conflict Detection and Resolution

You might also like