DM
DM
• Data Overload: Every day, countless terabytes and even petabytes of data are generated
from various sources, such as businesses, social networks, scientific experiments, healthcare
records, and global telecommunications. This data explosion, a direct result of widespread
computerization and advances in data storage technologies, makes it challenging to analyze
data effectively using traditional methods.
• Extracting Insights: With so much data available, valuable insights often remain hidden
within it. For instance, retail stores, large corporations, scientific research facilities, and even
government agencies generate huge data volumes. Manually analyzing this data to gain
insights, such as customer preferences or emerging trends, is impractical. Data mining meets
this need by automating the process of discovering hidden patterns, correlations, and useful
information from large datasets, transforming raw data into actionable knowledge.
• Information Age Misconception: People often say we live in an "Information Age," but this is
misleading. We actually live in a "Data Age," where information still needs to be extracted
from raw data to be useful. Despite access to vast amounts of data, most of it remains
unanalyzed and unused.
• Data vs. Knowledge: Google’s Flu Trends is a classic example demonstrating how data mining
can convert raw data into valuable knowledge. By analyzing search terms related to flu
symptoms, Google could predict flu outbreaks earlier than traditional health systems. Such
insights are possible only through sophisticated data analysis that can aggregate and
interpret patterns across massive datasets. Data mining, therefore, helps to unlock this
potential, turning data into knowledge.
o 1960s: Early computing systems primarily handled simple file processing, storing
data without sophisticated structures.
o Mid-1980s Onward: Database technology began supporting complex data types and
handling new forms of data like multimedia, spatial, and temporal data, expanding
the storage of information across domains.
o Late 1980s-1990s: The era of data warehousing and Online Analytical Processing
(OLAP) began, enabling businesses to store historical data for decision-making and
analysis. This led to the development of data warehouses—centralized repositories
that aggregate data from various sources.
• Data Tombs: The rapid accumulation of data has created a “data rich but information poor”
scenario. Vast amounts of stored data remain untouched, acting as “data tombs.” For
instance, companies may collect extensive customer transaction data but struggle to derive
meaningful insights without proper analysis tools.
• Automated Knowledge Discovery: Unlike traditional knowledge bases or expert systems that
rely on human experts to input knowledge, data mining automates the discovery of valuable
insights. It identifies meaningful patterns, trends, and correlations in data, which can then be
used to make informed business decisions.
o Business: Companies can use data mining for market basket analysis, customer
segmentation, and product recommendation.
o Healthcare: Data mining is used to analyze patient records and identify trends in
health data, like potential disease outbreaks or effective treatment paths.
o Finance: Financial institutions employ data mining for fraud detection, risk
assessment, and customer behavior analysis.
• From Data Tombs to Golden Nuggets: Data mining effectively turns “data tombs” into
“golden nuggets” of information. By extracting valuable knowledge from raw data,
organizations are better equipped to tackle challenges, improve efficiency, and gain a
competitive advantage.
Data mining is a process that involves discovering patterns and valuable knowledge from large
datasets. It is often described as the process of finding "nuggets" of useful information from massive
amounts of raw data, much like gold mining. Despite its name, the term "data mining" may not fully
capture the essence of the process, which is more accurately a form of "knowledge discovery from
data" (KDD). This broader concept emphasizes the extraction of meaningful patterns from data
rather than just processing the data itself.
Data mining is considered a key part of the larger knowledge discovery process, which involves
several stages. Here’s a detailed look at each stage:
1. Data Cleaning: This initial step involves removing noise and inconsistencies in the data.
Cleaning is crucial because data often comes from different sources and may contain errors,
missing values, or irrelevant information.
2. Data Integration: In this step, data from multiple sources are combined into a unified
dataset. This integration often occurs before analysis and may involve merging databases,
files, or streams into a cohesive data warehouse.
3. Data Selection: After integration, only relevant data is selected for the analysis. For instance,
if the goal is to understand customer buying behavior, only transaction data may be chosen,
leaving out irrelevant datasets.
4. Data Transformation: The selected data is transformed into a format that’s suitable for
mining. This transformation can involve summarizing or aggregating data, creating derived
variables, or normalizing values to fit the analysis methods.
5. Data Mining: This is the core step where intelligent techniques (like machine learning
algorithms) are applied to extract patterns or knowledge from the data. Techniques like
classification, clustering, and association rule mining are commonly used.
6. Pattern Evaluation: After mining, the extracted patterns are evaluated based on
"interestingness" criteria, which helps identify the most valuable insights for the end user.
Only the patterns that meet the required threshold for interestingness are retained.
7. Knowledge Presentation: The final step involves presenting the extracted knowledge to
users in a comprehensible way. Visualization techniques or summary reports are often used
here to make the insights actionable and interpretable.
In industry and academia, data mining is sometimes used interchangeably with the entire knowledge
discovery process, due to its concise terminology. However, data mining itself is just one part of this
broader process, as it refers specifically to the step of pattern extraction.
• Databases: Structured databases, like relational databases, serve as common data sources.
• Data Warehouses: Integrated data warehouses that store historical and aggregated data are
often mined for trends and patterns.
• The Web: Web mining is used to extract information from web pages and online resources,
providing insights into web traffic, customer behavior, and trends.
• Other Repositories: Data mining can also be performed on text, multimedia, social networks,
and streaming data.
In Summary
Data mining is the process of extracting useful patterns and insights from vast datasets, making it a
powerful tool in transforming raw data into valuable knowledge. The structured stages of data
preprocessing, mining, and evaluation within the knowledge discovery process ensure that only the
most relevant and valuable insights are obtained, making data mining a cornerstone in data-driven
decision-making across industries.
Data mining can be applied to various types of data sources, each offering unique opportunities for
extracting patterns, trends, and insights. Here are some of the key types of data that can be mined:
1. Database Data
Database Management Systems (DBMS): Databases are organized collections of data managed by
software programs (DBMS) to allow for easy storage, retrieval, and management of data. They
ensure data integrity, security, and concurrency for users, and are usually structured into tables,
known as relational databases, which are common for data mining applications.
• Relational Databases: Consist of tables (relations) where each table has rows (records) and
columns (attributes). Each row represents a unique entity, identified by a key, with attributes
providing specific information about the entity. Example databases often include entities like
customer, item, employee, and branch, and their relationships, such as purchases and sales
transactions.
• Query and Data Manipulation: Users can query relational databases using languages like
SQL, which allows for operations like selection, projection, join, and aggregate functions
(e.g., sum, average, count). For instance, queries could retrieve all items sold in the last
quarter or analyze sales patterns.
Data Mining Applications: Through data mining, trends, associations, and predictions can be
identified, such as predicting customer credit risk based on attributes like income and previous credit
information, or finding sales deviations across different time periods.
2. Data Warehouses
Definition and Purpose: A data warehouse is a centralized repository created by integrating data
from various sources, providing a unified view of data across an organization. This consolidated data
is preprocessed, transformed, and stored to support long-term analysis and decision-making.
• OLAP Operations: OLAP tools support operations like drill-down (viewing more detailed
data, such as monthly instead of quarterly) and roll-up (viewing more summarized data, like
sales per region instead of city). These operations allow users to analyze data from different
perspectives.
Data Mining Applications: In data warehouses, multidimensional data mining enables the discovery
of patterns across combinations of dimensions. This style of mining supports in-depth exploratory
analysis to identify patterns at various levels of detail, such as sales trends per region or seasonality
effects.
3. Transactional Data
Data Mining Applications: Transactional data mining is particularly valuable for association analysis,
such as frequent itemset mining. This involves finding sets of items that are frequently purchased
together, enabling strategies like bundling products (e.g., offering discounts on printers when a
computer is purchased) to boost sales.
Beyond traditional databases and data warehouses, data mining can be applied to more complex
forms of data:
• Data Streams: Continuous flows of data, like sensor data or network traffic, which require
real-time processing.
• Ordered or Sequence Data: Such as time-series or log data, used for trend analysis and
sequence prediction.
• Graph or Networked Data: Data representing relationships, like social networks, which
enable analysis of connections, influence, and community structures.
• Spatial Data: Data with geographic components, like maps or location-based services, useful
in applications such as geospatial analysis.
• Text Data: Unstructured data from documents, emails, or web pages, suitable for text mining
and natural language processing.
• Multimedia Data: Includes images, audio, and video, often used in media content analysis.
• Web Data: Information from web pages and user interactions, mined for insights on
browsing behavior, user preferences, and trends.
Class/Concept Description: This involves summarizing the characteristics of data within a specific
class or concept. For example, summarizing customer behavior in a retail setting by categorizing
them as high spenders or low spenders. Key techniques used here are:
• Data Characterization: Summarizes the general features of a target class, often with
statistical summaries or data visualizations.
Mining Frequent Patterns, Associations, and Correlations: This functionality identifies recurring
relationships and patterns within the data. Techniques include:
• Frequent Itemset Mining: Identifies items that frequently appear together, like the common
purchase combination of bread and milk in retail.
• Association Rule Mining: Establishes rules, like “if X then Y,” that reveal item relationships
with statistical measures of support and confidence.
• Regression: Predicts a continuous value rather than a class label, often using linear
regression or other numerical prediction methods.
Cluster Analysis: Groups data objects based on their similarities, without needing predefined
labels. Clustering algorithms group items in a way that maximizes similarity within groups and
minimizes similarity between groups. This is often used in market segmentation.
Outlier Analysis: Identifies data points that deviate significantly from the majority of data,
potentially signaling errors, fraud, or unique cases.
Pattern Interestingness: This evaluates whether patterns discovered are useful and relevant.
Interesting patterns are those that are actionable, novel, or useful for decision-making.
1. Class/Concept Description:
o Example: A retail store wants to understand its customer base better. It uses data
characterization to describe the features of its high-spending customers, noting
characteristics like average age, income, preferred shopping times, and frequently
purchased items. Through data discrimination, it compares these high spenders to
average customers, highlighting distinct behaviors, like high spenders preferring
luxury items or shopping online more often.
o Description: Identifies items or events that frequently occur together, which can
reveal hidden relationships in data.
o Example: A supermarket analyzes shopping cart data and finds that "milk" and
"bread" are frequently purchased together. An association rule could be: “If a
customer buys milk, there is an 80% chance they will also buy bread.” This insight
could lead to strategic placement of these items to boost sales or create promotional
bundles.
4. Cluster Analysis:
o Description: Identifies clusters or groups of data that are similar to each other but
different from other clusters, which is helpful for segmentation.
5. Outlier Analysis:
o Description: Detects data points that significantly differ from others, which can
indicate fraud, errors, or unique occurrences.
o Example: A credit card company uses outlier analysis to flag unusual transactions,
such as a sudden large purchase in a foreign country by a customer who normally
only makes small local purchases. This can help prevent fraud by prompting a
security check or notifying the customer.
6. Evolution Analysis:
o Description: Studies patterns or trends in data over time, enabling the detection of
changes or prediction of future behaviors.
In data mining, not all patterns generated are necessarily useful or interesting, and identifying truly
interesting patterns is crucial. Here's an overview of what makes a pattern interesting and the
methods used to focus on valuable patterns:
• Valid on new/test data: Patterns should be reliable and maintain accuracy across different
datasets.
• Potentially useful: Patterns should offer insights that can guide decision-making.
• Hypothesis Confirmation: Patterns are also interesting if they validate a specific hypothesis a
user is trying to test.
• Support: For association rules, this is the percentage of transactions that contain a particular
rule. For example, if 30% of transactions include both "milk" and "bread," the support for the
rule “milk → bread” is 30%.
• Confidence: Indicates the likelihood that a transaction containing one item (e.g., "milk") will
also contain another (e.g., "bread"). If 70% of transactions containing "milk" also contain
"bread," the confidence is 70%.
• Accuracy: For classification rules, accuracy measures the percentage of instances correctly
classified.
• Coverage: Similar to support, it measures the percentage of data instances to which a rule
applies.
• Complexity: Measured as the simplicity or bit length of patterns, with simpler patterns often
preferred.
Example: In association rule mining, support and confidence help filter out patterns that are
statistically irrelevant or likely due to chance. A threshold (e.g., minimum confidence of 50%) can
eliminate patterns with lower relevance.
• Unexpectedness: Patterns that contradict user beliefs can reveal new insights.
• Actionability: Patterns that suggest strategic actions. For example, a pattern suggesting that
small earthquake clusters often precede a large quake can be actionable for disaster
preparedness.
This question of completeness addresses whether data mining algorithms can generate all relevant
patterns:
• Generating every possible pattern is typically impractical due to the sheer volume.
• User-defined constraints and interestingness thresholds can help focus on relevant patterns,
improving efficiency.
Generating only interesting patterns remains a challenge and an area of optimization in data mining.
By using interestingness measures, systems can prioritize the discovery and ranking of patterns likely
to be relevant, filtering out less useful ones.
These measures, combined with constraints, play a crucial role in refining the search space and
guiding the data mining process, which enhances both efficiency and relevance.
Data preprocessing is a critical step in data mining and knowledge discovery, as it helps enhance the
quality of data and thereby improves the effectiveness of the analysis. Here’s a breakdown of why
data preprocessing is essential and the tasks involved:
Data quality refers to how well data meets the requirements of its intended use, encompassing
factors like:
For example, as a sales manager at AllElectronics, you might notice missing values, inconsistent
department codes, or out-of-date customer addresses, each affecting data quality. While a marketing
analyst may overlook some address inaccuracies, a sales report manager might consider even minor
inconsistencies significant.
Data Cleaning
Data cleaning addresses issues like missing, noisy, or inconsistent data. Techniques include:
• Filling Missing Values: Using methods like mean imputation or predictive modeling.
• Smoothing Noisy Data: Techniques such as binning, regression, or clustering can handle
errors or deviations in data.
• Consistency Checking: Resolving discrepancies, such as different codes for the same item
across databases.
Example: If sales data has missing “promotion status” for certain items, filling those values ensures
accurate analysis.
Data Integration
Data integration combines data from multiple sources, often necessary when using data warehouses.
Challenges include:
• Schema Matching: Aligning attributes with different names or formats across databases
(e.g., "cust_id" vs. "customer_id").
Example: Combining customer information from multiple branches of AllElectronics while resolving
attribute name inconsistencies (like “William” vs. “Bill”).
Data Reduction
Data reduction decreases data size while preserving essential information, making analysis faster and
more efficient. Methods include:
• Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) to reduce the
number of features.
Data Transformation
• Normalization: Scaling data to a specified range, like [0, 1], to make attributes comparable.
• Discretization: Converting continuous data into discrete intervals or categories (e.g., age
ranges into “youth,” “adult,” and “senior”).
• Concept Hierarchy Generation: Transforming detailed data into higher-level concepts for
abstract analysis.
Example: Scaling “age” and “annual salary” attributes to the same range to improve distance-based
analysis accuracy.
Summary
Real-world data are often incomplete, inconsistent, and contain errors, making preprocessing
essential for high-quality analysis. By ensuring data quality through preprocessing tasks,
organizations can derive more accurate, efficient, and reliable insights for decision-making.
Data cleaning, or data cleansing, is a critical step in data preprocessing that addresses incomplete,
noisy, or inconsistent data. Here’s an overview of its main concepts:
Missing data values can hinder analysis, and there are several methods for managing them:
1. Ignoring the Tuple: Useful when many attributes are missing, though this discards potentially
useful information.
3. Global Constant Replacement: Missing values can be filled with a constant like "Unknown,"
though it may skew analysis by introducing artificial patterns.
4. Using Central Tendencies (Mean or Median): Based on data distribution, mean or median
values can fill gaps, assuming central tendencies are suitable replacements.
5. Class-Specific Mean/Median: For attributes within a particular class, the class mean or
median may be a more accurate fill.
6. Most Probable Value Prediction: Techniques like regression or decision trees can predict
missing values based on existing data patterns.
Some methods may introduce biases, but using related attributes often yields better estimates,
preserving attribute relationships.
Noise refers to random errors or variances in data. To manage this, several smoothing techniques are
applied:
1. Binning: Divides sorted data into bins. Smoothing can be done by bin means, medians, or
boundaries, which smooth data based on neighboring values.
2. Regression: Fits data to a function, like a line (linear regression) or a surface (multiple
regression), based on relationships among attributes.
3. Outlier Analysis: Clustering can identify outliers, which may represent noise.
These methods also reduce distinct attribute values, aiding in data discretization.
• Data Transformation: Tools for transformation include data scrubbing (basic error detection
and correction), data auditing (rule violation detection), and data migration/ETL tools for
transformation specification.
Increasingly, data cleaning tools emphasize interactivity. For example, Potter’s Wheel provides a step-
by-step interface, allowing real-time transformation checks. Declarative languages for data
transformation are also emerging, enhancing interactivity and efficiency in data cleaning processes.
These strategies ensure data reliability, though the process can be iterative and time-consuming,
requiring updates to metadata to ease future data cleaning.
Data Integration
Data mining often requires data integration, which is the process of merging data from multiple
sources. Effective integration helps reduce redundancies and inconsistencies in the resulting dataset,
thereby improving the accuracy and efficiency of subsequent data mining tasks.
• Schema Integration: Aligning the data structures and formats from different sources.
Metadata plays a crucial role in schema integration and data cleaning, helping to identify and
transform attributes accurately. For instance, different coding schemes for payment types may need
to be standardized for effective integration.
During data integration, it is essential to accurately match equivalent entities from various data
sources, such as databases, data cubes, or flat files. This process involves careful consideration of:
• Attribute Metadata: Attributes may have different names, meanings, data types, and
permissible value ranges. Understanding this metadata helps avoid errors during integration.
• Functional Dependencies and Referential Constraints: Ensuring that these constraints are
preserved across different systems is critical. For example, the way discounts are applied (to
the order versus individual line items) must be consistent to avoid errors in the target
system.
Redundancies may arise when one attribute can be derived from others. To detect redundancy,
correlation analysis can be employed:
• Nominal Data: For nominal attributes, a chi-square test can determine the correlation
between two attributes. This test assesses the independence of two categorical variables by
comparing observed frequencies against expected frequencies derived from their
distribution.
• Numeric Data: For numeric attributes, the correlation coefficient (Pearson’s r) measures the
strength of the relationship between two variables. A high positive or negative value
indicates a strong correlation, suggesting potential redundancy.
Tuple duplication is another critical issue in data integration. It occurs when identical records exist for
a unique data entry. Common causes include:
• Denormalized Tables: Used to enhance performance, they may lead to redundancies if not
managed properly.
• Data Entry Errors: Inconsistent data updates can create discrepancies in identical records,
such as different addresses for the same purchaser in a database.
Data integration must also address conflicts in data values, which can occur due to differences in
representation or encoding between data sources. Common examples include:
• Unit Differences: A weight attribute may be recorded in metric units in one source and
imperial units in another.
• Grading Systems: Different educational institutions may use varying grading schemes,
complicating data exchanges.
• Abstraction Levels: Attributes recorded at different levels of detail can lead to confusion,
such as total sales figures representing different scopes across systems.
By employing effective detection and resolution strategies, data integration can be streamlined,
leading to more reliable datasets for analysis.
Conclusion
Effective data integration is crucial for successful data mining. It requires careful consideration of
schema matching, entity identification, redundancy analysis, and conflict resolution to create a
coherent and accurate dataset from multiple sources.