0% found this document useful (0 votes)
11 views46 pages

DMT Unit1

The document provides an overview of data mining, detailing types of data such as database data, data warehouses, and transactional data, along with their structures, purposes, and mining techniques. It discusses functionalities like characterization, classification, clustering, and outlier analysis, as well as technologies used in data mining including statistics, machine learning, and database systems. Additionally, it highlights applications in business intelligence and web search engines, while addressing major issues in the field such as methodology, user interaction, efficiency, and data diversity.

Uploaded by

is7636665
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views46 pages

DMT Unit1

The document provides an overview of data mining, detailing types of data such as database data, data warehouses, and transactional data, along with their structures, purposes, and mining techniques. It discusses functionalities like characterization, classification, clustering, and outlier analysis, as well as technologies used in data mining including statistics, machine learning, and database systems. Additionally, it highlights applications in business intelligence and web search engines, while addressing major issues in the field such as methodology, user interaction, efficiency, and data diversity.

Uploaded by

is7636665
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Data Mining

1. Database Data

• Stored in: Relational databases (tables with rows and columns).

• Examples: Customer records, employee data, item listings.

• Structure: Uses a schema (e.g., customer(custID, name, age, income...)).

• Tools: SQL for querying and aggregating data.

• Data Mining Use: Identify trends, patterns, and deviations (e.g., predicting
credit risk, analyzing sales).

2. Data Warehouses

• Stored in: Centralized repositories integrating data from multiple sources.

• Structure: Multidimensional (data cubes).

• Purpose: Historical analysis and business decision support.

• Features:

o Organized by subject (customer, item, time, etc.)

o Uses OLAP operations like roll-up and drill-down.

• Data Mining Use: Discover patterns at various levels of granularity, enable


exploratory analysis.

3. Transactional Data

• Stored in: Flat files or tables representing individual transactions.

• Structure: Each record contains a transaction ID and items involved (e.g., T100:
I1, I3, I8).

• Examples: Retail sales, flight bookings, clickstream data.

• Data Mining Use:

o Market basket analysis

o Frequent itemset mining to discover what products are often bought


together.

Other Forms of Data (Mentioned briefly)

• Data Streams
• Sequence Data

• Graph or Network Data

• Spatial, Text, Multimedia Data

• Web Data (WWW)

Feature Database Data Data Warehouse Data Transactional Data

Definition Structured data Integrated, historical Data representing


stored in relational data from multiple real-world
databases (RDBMS) sources stored for transactions or
using tables. analytical purposes. events, often
sequential and time-
stamped.

Purpose Real-time operations Strategic decision Capturing and


and day-to-day support and historical analyzing individual
transactions (OLTP). analysis (OLAP). user or business
actions (e.g.,
purchases).

Structure Tables (relations) Multidimensional Flat files or nested


with rows (tuples) data cubes with tables with
and columns summarized transaction IDs and
(attributes). information. lists of items or
events.

Storage System Relational Database Centralized Data Flat files, NoSQL


Management System Warehouse (e.g., DBs, or special
(RDBMS) like MySQL, Amazon Redshift, transactional DB
Oracle, SQL Server. Snowflake, Google systems.
BigQuery).

Schema Type Normalized schemas Star or Snowflake Often denormalized


(3NF or ER models). schemas for fast or semi-structured
aggregation and (list of items per
querying. transaction).

Data Granularity Fine-grained Aggregated Fine-grained


(detailed individual (summary over time (detailed per
records). or groups). transaction).
Data Sources Single operational Multiple Point-of-sale
system (e.g., POS heterogeneous systems, sensors,
system). sources (e.g., regional web logs, etc.
DBs, logs).

Update Frequency Frequently updated Periodically updated Continuously


(daily, hourly). (daily, weekly, updated or
monthly). appended.

Examples - customer(custID, - Sales by region and - T100: [I1, I3, I8] -


name, age, income) - time - Quarterly Clickstream log:
item(itemID, price, product performance UserID: [Page1,
category) Page3, Page7]

Query Type SQL-based queries: OLAP queries: Drill- Pattern mining:


SELECT, JOIN, down, Roll-up, Slice, association rule
GROUP BY, etc. Dice mining, sequence
pattern mining

Mining Techniques Classification, Multidimensional Association rule


clustering, outlier pattern mining, trend mining, sequential
detection, regression analysis, anomaly pattern mining,
detection market basket
analysis

Use Cases Predicting customer Strategic decisions Recommending


churn, identifying like which regions are products, detecting
fraud, customer underperforming or buying behavior,
segmentation. product trend promotion bundling.
analysis.

Tools/Technologies SQL, DBMS (MySQL, OLAP tools (Tableau, Apache Hadoop,


PostgreSQL), Python Power BI), ETL Spark, NoSQL,
(pandas) pipelines, Cube Association Rule
computation Mining (Apriori, FP-
Growth)

Functionalities

1. Characterization and Discrimination (Class/Concept Description)

➤ Characterization:

Describes the general features of data belonging to a target class.


• Provides a concise summary, usually through descriptive statistics, OLAP
operations, or attribute-oriented induction.

• Example: A retail manager wants to know the profile of customers who spend
over $5000/year. The result might show that they are typically middle-aged,
employed, and have good credit ratings.

• Output can be presented in the form of:

o Charts (bar, pie)

o Generalized relations

o Characteristic rules (e.g., "If income > 50K → likely to spend > $5000")

• It's used for summarizing and understanding data patterns in a group.

➤ Discrimination:

• Compares the features of a target class against one or more contrasting


classes.

• Example: Comparing customers who shop frequently for computer products vs.
those who shop rarely. Differences may include age, education, etc.

• Helps identify features that distinguish between groups (e.g., age, occupation).

• Often results in discriminant rules, e.g., “If age between 20-40 and education =
university → frequent buyer.”

Key Difference: Characterization is about describing one group; discrimination is


about comparing multiple groups.

2. Mining Frequent Patterns, Associations, and Correlations

• Aims to find repetitive patterns, associations, or correlations in large


datasets.

• This includes:

o Frequent Itemsets: Sets of items that often appear together in


transactions (e.g., bread and butter).

o Sequential Patterns: Items purchased in a sequence (e.g., laptop →


camera → memory card).

o Substructures: Patterns in structural forms like graphs or trees.

➤ Association Rule Mining:


• Example: “buys(X, 'computer') → buys(X, 'software') [support: 1%, confidence:
50%]”

o Means 1% of transactions include both, and 50% of computer buyers also


buy software.

• Can be single-dimensional (same predicate) or multi-dimensional (age,


income, buys).

• Used in market basket analysis, cross-selling, product recommendations.

➤ Correlation Analysis:

• Goes beyond co-occurrence to measure statistical significance between items


(e.g., chi-square test).

Importance: Helps in identifying what tends to happen together in data, enabling


targeted marketing, inventory planning, and more.

3. Classification and Regression (Predictive Analysis)

➤ Classification:

• Builds a model (classifier) that assigns data to predefined categories or


classes.

• Requires labeled training data.

• Output can be:

o IF-THEN rules

o Decision trees

o Neural networks

o SVMs, k-NN, Bayesian classifiers

• Example: Classifying items based on sales response (good, mild, none).

• Used in spam detection, credit scoring, disease diagnosis, etc.

➤ Regression:

• Predicts continuous numeric values, not categories.

• Example: Predicting the expected revenue from a product.

• Methods include linear regression, polynomial regression, and advanced ML


techniques.
• Used in forecasting, pricing models, stock prediction.

Key Difference: Classification predicts discrete labels, regression predicts


continuous values.

4. Cluster Analysis

• Groups a set of objects into clusters so that:

o Intra-cluster similarity is high

o Inter-cluster similarity is low

• No prior labels are required (unsupervised learning).

• Each cluster can later be treated as a class for further analysis.

• Example: Segmenting customers based on purchasing behavior or geographic


location.

• Visualized often using 2D/3D plots (e.g., k-means, DBSCAN).

• Applications include customer segmentation, image recognition, bioinformatics.

Purpose: To discover natural groupings within data without predefined categories.


5. Outlier Analysis

• Identifies data objects that deviate significantly from the general pattern.

• Such data points are called outliers or anomalies.

• Useful in applications where rare events are more important than common
ones:

o Fraud detection (e.g., unusual credit card activity)

o Intrusion detection

o Medical anomalies

• Techniques:

o Statistical methods (assuming distribution models)

o Distance-based methods (objects far from others)

o Density-based methods (like LOF – Local Outlier Factor)

• Not all outliers are noise—many are insightful and can drive important
decisions.

Technologies Used in Data Mining


Data mining is an application-driven field that integrates various techniques from
multiple disciplines to extract valuable insights from data. These include:

1. Statistics

o Role in Data Mining: Statistics is used to model data and target classes.
A statistical model describes the behavior of data using mathematical
functions and probability distributions.

o Applications in Data Mining:

▪ Data Characterization and Classification: Statistical models can


be used to classify and characterize data.

▪ Handling Noise and Missing Data: Statistics helps in modeling


and handling noisy or missing data during the data mining process.
▪ Prediction and Forecasting: Statistical models are key for
prediction tasks, providing a framework for making inferences
about the data.

▪ Verifying Data Mining Results: After building classification or


prediction models, statistical hypothesis testing helps verify their
accuracy and significance.

o Challenges: Scaling statistical methods for large datasets is complex


due to computational costs. This issue is exacerbated for online
applications requiring real-time processing.

2. Machine Learning

o Role in Data Mining: Machine learning focuses on enabling computers to


learn patterns and make decisions based on data. Machine learning is
used in data mining for tasks like classification and clustering.

o Types of Learning Methods:

▪ Supervised Learning (Classification): Involves training a model


with labeled data to recognize patterns, such as recognizing postal
codes from handwritten images.

▪ Unsupervised Learning (Clustering): The model learns from data


without labels, finding hidden patterns or groups (e.g., recognizing
different digits in handwritten data without predefined labels).

▪ Semi-supervised Learning: Combines both labeled and


unlabeled data. Labeled data helps build models, while unlabeled
data helps refine the model's boundaries, improving accuracy.

▪ Active Learning: The model actively queries humans (domain


experts) to label uncertain data points, thus improving the model
with minimal human input.

o Challenges: While machine learning focuses on accuracy, data mining


also emphasizes efficiency, scalability, and handling diverse types of
data.

3. Database Systems and Data Warehouses

o Role in Data Mining: Database systems handle the storage,


management, and retrieval of data, and they play a crucial role in
ensuring that data mining can scale to large datasets.

o Data Warehousing: A data warehouse integrates data from various


sources and timeframes into a unified structure. It enables advanced
data analysis by consolidating data into multidimensional space, known
as data cubes.

o Data Mining Integration: Modern database systems often incorporate


data mining capabilities to extend their analytic power. Data mining tools
can operate directly on data stored in databases to identify patterns.

o Challenges: Data mining often involves working with real-time streaming


data, which requires efficient database technologies to process large
volumes of data quickly.

4. Information Retrieval (IR)

o Role in Data Mining: Information retrieval involves searching and


retrieving relevant documents or information from a large database or the
web. Unlike database systems, IR deals with unstructured data (e.g., text
or multimedia).

o Probabilistic Models: IR uses probabilistic models to measure the


similarity between documents. Text documents are often represented as
a bag of words, where the presence and frequency of words are
important, but word order is not.

o Topic Modeling: IR systems use models to identify underlying topics in


collections of documents. These topics are represented as probability
distributions over a vocabulary, and documents may belong to multiple
topics.

o Integration with Data Mining: Combining IR with data mining techniques


enables deeper analysis of text and multimedia data, facilitating better
search and analysis of large, unstructured datasets (e.g., web data, digital
libraries, healthcare records).

Applications of Data Mining


Data mining plays a vital role in various fields where large amounts of data need to be
analyzed. Here are two major applications:

1. Business Intelligence (BI)

o Purpose: To understand business contexts such as customers, market


trends, and competitors.
o Key Techniques:

▪ Classification and Prediction for sales, market analysis, and


customer feedback.

▪ Clustering for Customer Relationship Management (CRM),


grouping customers by similarities.

▪ Characterization mining for understanding customer groups and


developing tailored programs.

o Importance: BI allows businesses to make smart decisions, retain


valuable customers, and gain insights into competitors. Without data
mining, effective market analysis would be difficult.

2. Web Search Engines

o Purpose: To retrieve information from the web in response to user


queries.

o Techniques Used:

▪ Crawling: Deciding which web pages to crawl and how frequently.

▪ Indexing: Choosing which pages to index and how to structure the


index.

▪ Ranking: Determining how to rank pages based on relevance and


quality.

o Challenges:

▪ Data Volume: Search engines deal with massive amounts of data,


requiring cloud computing for processing.

▪ Real-Time Processing: Search engines need to respond to user


queries instantly, often requiring continuous updates and real-
time data mining.

▪ Small Data Issues: Many queries are asked rarely, posing a


challenge for mining methods designed for large datasets.

Major Issues in Data Mining


Data mining, being a rapidly evolving field, faces several challenges and open research
areas. These challenges can be categorized into five main groups:

1. Mining Methodology
o New Knowledge Types: Data mining covers a broad range of tasks (e.g.,
classification, regression, clustering), and as new applications emerge,
new mining techniques are developed.

o Multidimensional Data Mining: Mining knowledge across different


dimensions, such as combining various attributes in data cubes.

o Interdisciplinary Approaches: Integrating methods from natural


language processing, software engineering, and other fields enhances
data mining.

o Handling Uncertainty: Dealing with noisy or incomplete data is a


significant challenge in data mining, requiring techniques like data
cleaning and outlier detection.

2. User Interaction

o Interactive Mining: The mining process should be flexible and dynamic,


allowing users to refine searches and explore data interactively.

o Incorporation of Background Knowledge: Including domain-specific


knowledge, constraints, or rules can guide the mining process towards
more useful results.

o Data Mining Query Languages: High-level languages or interfaces allow


users to define and optimize ad hoc queries, making the process more
user-friendly.

o Visualization of Results: Presenting mining results in an understandable,


visually intuitive way is crucial for the usability of data mining systems.

3. Efficiency and Scalability

o Algorithm Efficiency: Data mining algorithms need to handle large


datasets quickly and efficiently, especially as data volumes grow.

o Parallel and Distributed Mining: Large datasets often require parallel


processing across distributed systems. Cloud and cluster computing are
common methods to scale data mining processes.

o Incremental Mining: Incremental algorithms that can update models as


new data arrives without reprocessing all existing data are a key area of
research.

4. Diversity of Data Types

o Complex Data Types: Data mining must handle a variety of data types,
from structured databases to unstructured data like text and images.
o Dynamic Data: Some data, like online data streams or real-time sensor
data, change constantly, presenting challenges for traditional mining
methods.

o Interconnected Data: Many datasets are linked (e.g., social networks,


web data), requiring mining techniques that can handle and exploit these
connections.

5. Data Mining and Society

o Social Impact: Data mining affects privacy, security, and social


dynamics. How can we use data mining for societal benefit while
preventing misuse?

o Privacy-Preserving Mining: Safeguarding individuals' privacy while


conducting data mining is crucial. Ongoing research focuses on privacy-
preserving data mining methods.

o Invisible Data Mining: Many systems perform data mining behind the
scenes without users' awareness. For instance, e-commerce sites track
user behavior to recommend products.

What is a Data Warehouse?


A data warehouse refers to a central repository where data from different sources is
stored and organized for analysis and decision-making. It allows businesses to store
historical data that supports strategic decisions. Data warehouses are essential in
today’s competitive world as organizations use them to gain insights into various
aspects of their operations and make informed decisions.

Key Features of a Data Warehouse

1. Subject-Oriented:

o Data warehouses are designed around major subjects of interest such as


customers, products, suppliers, and sales.

o Unlike operational databases, which focus on day-to-day transactions,


data warehouses are structured to provide a more analytical view that
helps decision-makers.

o Data is organized to reflect decision support processes, not operational


activities.
2. Integrated:

o A data warehouse integrates data from multiple heterogeneous sources


like relational databases, flat files, and transaction logs.

o It ensures consistency in naming conventions, data formats, and attribute


measures.

o The data is cleaned and standardized before being loaded into the
warehouse.

3. Time-Variant:

o Data in a warehouse is typically historical, covering several years (e.g., 5-


10 years) to help analyze trends over time.

o Each data set within a warehouse includes a time element, either


implicitly or explicitly, to track changes and trends over time.

4. Nonvolatile:

o Once data is stored in a warehouse, it is not changed. New data is only


appended to the system.

o A data warehouse does not require the mechanisms for transaction


processing, recovery, or concurrency control that are needed in
operational databases.

o The primary operations in a data warehouse are data loading and


querying.

Functions of a Data Warehouse

• A data warehouse consolidates large amounts of data for analysis and decision-
making purposes. It is not primarily designed for transactional operations but
rather for answering complex queries and providing insights into various
business activities.

• A data warehouse is often constructed by integrating data from multiple sources,


using processes like data cleaning, data integration, and data consolidation.

• Decision support technologies are used to query the data, generate reports,
and make strategic decisions based on the insights derived from the data.
Difference b/w Database and Datawarehouse
Feature OLTP (Operational Database OLAP (Data Warehouse
System) System)

Purpose Handles day-to-day Supports data analysis and


transactions and query decision making
processing

Users Clerks, clients, IT professionals Managers, executives, analysts


(customer-oriented) (market-oriented)

Data Content Current, detailed, real-time Historical, aggregated,


transactional data summarized data

Data Volume Typically smaller; focused on Very large; includes years of


current data historical data

Database Design Entity-Relationship (ER) model; Star or Snowflake schema;


application-oriented subject-oriented

View of Data Narrow view, specific to Broad view, integrates data


department/enterprise from multiple sources

Query Simple, short, atomic queries Complex queries, often


Characteristics and transactions involving aggregation

Operations Frequent inserts, updates, Primarily read-only operations


deletes (write-heavy) (read-heavy)

Concurrency Requires concurrency control Less need for concurrency, due


Control and recovery mechanisms to read-only nature

Access Patterns High transaction throughput, Low latency not critical, focus
quick response time on complex query performance

Performance Measured by number of Measured by query response


Metrics transactions per second time and analytical capability

Data Granularity Very detailed, fine-grained data Data stored at multiple levels of
granularity (from detailed to
summarized)

Frequency of Constant, very frequent Periodic, depending on


Access analysis/reporting needs
Data Integration Minimal; single source High; integrates data from
systems multiple heterogeneous
sources

Storage Medium Typically stored on a single Distributed storage systems


system or server due to large volume

Data Warehousing: A Multitiered Architecture


Data warehouses are designed using a three-tier architecture, which helps in
separating data storage, data processing, and data presentation. This architecture
ensures scalability, flexibility, and efficient data management. Here's a breakdown of
each tier:

1. Bottom Tier: Data Warehouse Server

• Role: This tier is responsible for storing the actual data.

• Technology Used: Usually a relational database management system


(RDBMS).

• Functions:

o Data Extraction: Pulls data from various operational and external


sources.

o Data Cleaning: Removes errors, inconsistencies, and duplicates.

o Data Transformation: Converts data into a common, unified format.

o Data Loading: Transfers the processed data into the data warehouse.

o Data Refreshing: Periodically updates the data warehouse to reflect


recent changes.

• Data Sources:

o Operational Databases: e.g., banking systems, sales systems.

o External Sources: e.g., market research reports, customer profiles from


third parties.

• Tools Used:

o Gateways (APIs) to connect and query the source systems:

▪ ODBC (Open Database Connectivity)


▪ OLEDB (Object Linking and Embedding Database)

▪ JDBC (Java Database Connectivity)

• Metadata Repository:

o Stores information about data (like source, format, transformations


applied).

o Acts as a directory for warehouse management and query optimization.

2. Middle Tier: OLAP Server

• Role: Acts as the processing layer, converting data into a form suitable for
analysis.

• Two Main OLAP Models:

1. ROLAP (Relational OLAP):

▪ Works on top of relational databases.

▪ Converts multidimensional operations into relational queries.

▪ Suitable for handling large volumes of data.

2. MOLAP (Multidimensional OLAP):

▪ Uses specialized multidimensional data structures (cubes).

▪ Faster for complex analytical queries, but may have storage


limitations.

• Functionality:

o Supports advanced analytical processing, including summarization,


aggregation, and complex computations.

o Optimized for read-heavy operations.

3. Top Tier: Front-End Tools

• Role: This is the user interface layer, where users interact with the system.

• Components:

o Query and Reporting Tools: For generating standard or custom reports.


o Data Analysis Tools: For ad-hoc querying, slicing, dicing, and drill-down
analysis.

o Data Mining Tools: For predictive modeling, clustering, trend analysis,


etc.

• Users:

o Business Analysts

o Executives and Managers

o Decision Makers

• Functionality:

o Provides a visual and interactive environment for exploring and


analyzing data.

o Supports dashboards, charts, graphs, and other visualizations.


Data Warehouse Models: Enterprise Warehouse,
Data Mart, and Virtual Warehouse
1. Enterprise Data Warehouse (EDW)

• Definition: A centralized data warehouse that stores information from across the
entire organization.

• Scope: Corporate-wide, cross-functional.

• Data:

o Includes both detailed and summarized data.

o Integrated from multiple operational systems or external sources.

• Implementation:

o Requires extensive business modeling.

o Typically built on mainframes, superservers, or parallel systems.

o May take months or years to design and deploy.

• Advantages:

o Single source of truth.

o High consistency and integration.

• Disadvantages:

o Time-consuming and expensive to build.

o Inflexible in dynamic environments.

2. Data Mart

• Definition: A smaller, focused version of a data warehouse that stores data for a
specific business line or department (e.g., marketing, sales).

• Scope: Departmental or subject-specific.

• Data:

o Typically summarized and related to specific business needs.

• Types:
o Independent Data Mart: Sourced directly from operational systems or
external providers.

o Dependent Data Mart: Sourced from an existing enterprise data


warehouse.

• Implementation:

o Uses low-cost servers (e.g., Linux, Windows).

o Takes weeks to build (faster ROI).

• Advantages:

o Quick to implement.

o Cost-effective.

o Flexible and adaptable to specific needs.

• Disadvantages:

o Risk of data silos.

o Complex integration later if not aligned with enterprise strategy.

3. Virtual Warehouse

• Definition: A set of virtual views over operational databases.

• Implementation:

o Does not store data physically.

o Queries are processed in real time using views.

• Advantages:

o Easy and fast to build.

o Cost-efficient (no extra storage).

• Disadvantages:

o Performance depends on operational systems.

o Requires high processing capacity for complex queries.

o Limited historical data analysis.


Top-Down vs. Bottom-Up Approaches to Data Warehouse Development

Aspect Top-Down Approach Bottom-Up Approach

Start Point Begin with enterprise data Start with departmental data
warehouse marts

Time & High cost and long duration Low cost, faster implementation
Cost

Flexibility Less flexible More adaptable

Integration Minimizes integration issues later May lead to integration


challenges

Suitability Best for long-term strategic Best for tactical and quick
planning solutions

Risk High initial investment with late Quick wins but may cause silo
returns issues

Recommended Approach: Incremental & Evolutionary

A hybrid approach is often best — combining top-down planning with bottom-up


implementation. The steps are:

1. Define High-Level Corporate Data Model

o Done within 1–2 months.

o Ensures consistent view of data across the organization.

2. Implement Independent Data Marts

o Developed in parallel using the high-level model.

o Quick deployment, department-level use.

3. Construct Distributed Data Marts

o Integrate various marts via hub servers.

o Enables data sharing across business units.

4. Build Multitier Data Warehouse

o Centralized Enterprise Data Warehouse becomes the primary data


store.

o Distributes data to dependent data marts as needed.


Data Warehouse Modeling: Data Cube and OLAP
Overview

• Data Warehouses and OLAP (Online Analytical Processing) tools are built on
the multidimensional data model.

• This model visualizes data as a data cube, which allows for interactive analysis
of multidimensional data.

• The modeling supports advanced operations like roll-up, drill-down, and


slicing/dicing to enable deep business insights.

Data Cube: A Multidimensional Data Model


What is a Data Cube?

• A data cube allows data to be modeled in n-dimensions (not just 3D).

• It is defined by:

o Dimensions: The perspectives for analysis (e.g., time, item, location).

o Facts/Measures: Quantitative data (e.g., dollars sold, units sold).


Key Concepts:

• Dimensions:

o Examples: time, item, branch, location.

o Each has a dimension table (e.g., for item: item name, brand, type).

• Fact Table:

o Contains numeric measures like dollars sold, units sold, etc.

o Links to each dimension via foreign keys.

Representation:

• 2-D Cube: Like a spreadsheet/table (e.g., time × item, for location = Vancouver).

• 3-D Cube: time × item × location.

• 4-D Cube: time × item × location × supplier — hard to visualize but conceptually
a series of 3D cubes.

Cuboids and Lattice

What is a Cuboid?

• A cuboid is a cube at a certain level of summarization (group-by).

• Base Cuboid: The lowest level (e.g., time, item, location, supplier).

• Apex Cuboid (0-D): The highest level — summarized over all dimensions.

Data Cube Lattice

• Given n dimensions, 2^n possible cuboids exist.

• Forms a lattice structure, representing all possible levels of summarization.

Example from the notes:

For dimensions: time, item, location, supplier, the lattice includes:

• 0-D Cuboid: total sales (summarized across all dimensions)

• 1-D Cuboids: {time}, {item}, {location}, {supplier}

• 2-D Cuboids: {time, item}, {item, supplier}, etc.

• 3-D Cuboids: {time, item, location}, etc.

• 4-D Cuboid: {time, item, location, supplier} (base cuboid)


Multidimensional Schema Models
1. Star Schema

• Structure:

o Central fact table (large, non-redundant).

o Connected dimension tables (flat, possibly redundant).

• Pros:

o Simple, fast query performance.

o Easy to understand.

• Cons:

o Some redundancy in dimension tables.

• Use case: Most common in data marts.

2. Snowflake Schema

• Structure:

o Like a star schema but dimension tables are normalized into sub-tables.

• Pros:

o Reduces redundancy.

o Easier maintenance.

• Cons:

o More complex queries due to joins.


o Slight performance trade-off.

• Use case: Less common; used when storage efficiency is more critical.

3. Fact Constellation (Galaxy Schema)

• Structure:

o Multiple fact tables sharing dimension tables.

• Pros:

o Models multiple interrelated subjects.

o Captures enterprise-wide data.

• Cons:

o Complex structure.

• Use case: Suitable for enterprise data warehouses.


Concept Hierarchies for Dimensions
What is a Concept Hierarchy?

• Maps low-level values (e.g., city) to higher-level concepts (e.g., country).

• Helps summarize or roll-up data in OLAP operations.

Types:

• Schema Hierarchy: Total/partial order (e.g., street < city < province < country).

• Lattice: Partial order where attributes don’t follow a single path (e.g., week <
year; day < month).

• Set-grouping Hierarchy: Value ranges grouped (e.g., price ranges: ($0–$200],


($200–$400], ...).

• Can be manually defined or automatically generated.

Measures in a Data Cube

Measures = Numeric values aggregated over dimension values (e.g., total sales).

Type Definition Examples Efficiency

Distributive Can be computed from sum(), count(), Very


subaggregates and combined. min(), max() efficient

Algebraic Computed using a fixed number avg() = Efficient


of distributive aggregates. sum()/count(),
stddev()

Holistic Requires full data scan, cannot median(), mode(), Inefficient


be broken into subaggregates. rank()

Most OLAP tools focus on distributive and algebraic measures for performance.
OLAP Operations
1. Roll-Up

• Definition: Aggregates data by climbing up a concept hierarchy or by reducing


dimensions.

• Example: Aggregating sales data from city to country (Toronto → Canada).

• Also called: Drill-Up (by some vendors).

2. Drill-Down

• Definition: The reverse of roll-up; navigates from summary data to more detailed
data.

• Example: Moving from quarterly sales data to monthly sales data.

• Also includes: Adding a new dimension (e.g., customer group) for more detail.

3. Slice

• Definition: Selects a single dimension value, resulting in a subcube.

• Example: Selecting data where time = Q1 only.

4. Dice

• Definition: Selects a range of values on two or more dimensions, resulting in a


subcube.

• Example: Data for location = Toronto or Vancouver, time = Q1 or Q2, and item
= home entertainment or computer.

5. Pivot (Rotate)

• Definition: Rotates the cube to view data from different perspectives.

• Example: Swapping the axes item and location for alternate visual layout.
6. Drill-Across

• Definition: Executes queries across multiple fact tables.

7. Drill-Through

• Definition: Accesses the bottom-level data in the data cube using SQL, typically
reaching into backend relational tables.

8. Other Advanced Operations

• Examples:

o Top-N/Bottom-N ranking.

o Moving averages, growth rates, depreciation.

o Currency conversion, internal return rates.

o Forecasting, trend/statistical analysis, variance calculations.

Role of Concept Hierarchies in OLAP

• Enable aggregation and drilling at various levels of detail.

• Facilitate multilevel data exploration across dimensions.


Concept Hierarchies
1. Definition

• Concept hierarchy is a sequence of mappings from low-level (specific)


concepts to high-level (general) concepts.

• Purpose: Allows multilevel data abstraction.

2. Types

• Schema Hierarchies (based on database attributes):


o Example for location: Street < City < Province/State < Country

o Example for time: Day < Month < Quarter < Year

• Lattice Structure:

o Supports partial orders, e.g., Day < Month < Quarter, and Week < Year.

• Set-Grouping Hierarchies:

o Created by grouping values into ranges or categories.

o Example for price: $0–$100, $100–$200, etc.

o User-defined groups: cheap, moderate, expensive

• Multiple Hierarchies:

o A single attribute can have multiple concept hierarchies depending on the


analysis (e.g., price can be by range or by category).

3. Sources of Concept Hierarchies

• Manual: Provided by users, domain experts.

• Automatic: Generated using statistical analysis (e.g., clustering).

Measures: Categorization & Computation

Measure: A numerical value computed for each multidimensional point (e.g.,


sales).

Categories of Measures:

Type Definition Examples Notes

Distributive Can be computed in sum(), count(), Easy and efficient to


parts and then min(), max() compute.
aggregated.

Algebraic Computed from a fixed avg() = sum/count, Depends on multiple


number (M) of stddev() distributive
distributive functions.
measures.

Holistic Cannot be expressed with a median(), mode(), Complex, may require


bounded number rank() approximation.
of distributive
results.
OLAP Engine Capabilities

• Enables complex analytical computations.

• Supports:

o Aggregations, hierarchies, ratios

o Forecasting, trend and statistical analysis

• Provides a user-friendly, interactive environment for querying multidimensional


dat

Detailed Definition of Mining Frequent Patterns

Frequent Pattern Mining is a fundamental task in data mining that involves discovering
patterns (like itemsets, sequences, or structures) that occur frequently in a
dataset. These patterns reveal relationships and associations between data
items that can be useful in decision-making, prediction, recommendation, and
classification.

Patterns
Key Concepts in Frequent Pattern Mining

1. Frequent Pattern

A frequent pattern is a set of items, subsequences, or structures that appear together


frequently in a dataset.

Examples:

• Frequent Itemset: {milk, bread} appears together in many transactions.

• Frequent Sequential Pattern: <PC → Digital Camera → Memory Card> appears in


many customer purchase histories.

• Frequent Structured Pattern: A frequently recurring subgraph in a chemical


compound dataset.

2. Itemset

A collection of one or more items. For instance, in a supermarket:

• {milk}, {milk, bread}, {bread, butter, eggs} are itemsets.


A k-itemset contains k items.

3. Support

• Support of an itemset is the proportion (or count) of transactions that contain


the itemset.

• It measures how frequently an itemset appears in the dataset.

Support(A) = (Number of transactions containing A) / (Total number of transactions)

4. Confidence

• Confidence of a rule A → B is the probability that transactions containing A also


contain B.

Confidence(A → B) = Support(A ∪ B) / Support(A)

It shows how reliable the rule is.

5. Association Rules

An association rule is an implication of the form:

css

CopyEdit

A → B [support = s%, confidence = c%]

It means that if A occurs, B is likely to occur with support s and confidence c.

6. Closed and Maximal Frequent Itemsets

• Closed Frequent Itemset: A frequent itemset that has no superset with the
same support.

• Maximal Frequent Itemset: A frequent itemset that has no frequent supersets.


Frequent Itemsets, Closed Itemsets, and Association Rules

1. Basic Definitions

➤ Itemset:

• A group of items.

• A k-itemset contains k items (e.g., {bread, milk} is a 2-itemset).

➤ Transaction (T):

• A set of items bought together.

• Identified by a unique TID.

2. Support and Confidence

➤ Support:

• Fraction of transactions that contain an itemset.

Transactions containing both A and B


Support(𝑨 ∪ 𝑩) =
Total transactions
➤ Confidence:

• Likelihood of item B occurring given item A.

Support(𝐴 ∪ 𝐵)
Confidence(𝐴 → 𝐵) =
Support(𝐴)

3. Association Rules

• Form: A → B

• Indicates a strong relationship: "If A occurs, B is likely to occur."

Strong Rules:

• Satisfy both:

o Minimum Support (minsup)

o Minimum Confidence (minconf)


4. Frequent Itemsets

• An itemset is frequent if its support ≥ minsup.

• Support count = number of transactions containing the itemset.

5. Closed Frequent Itemsets

• An itemset is closed if no proper superset has the same support count.

• Captures complete support info.

• Used to eliminate redundancy.

6. Maximal Frequent Itemsets

• An itemset is maximal frequent if it is frequent and none of its supersets are


frequent.

• Represents the outer boundary of frequent itemsets.

• More compact, but may lose support details of subsets.

7. Why Use Closed or Maximal?

• Mining frequent itemsets may result in an exponential number of patterns.

• Closed and maximal reduce computation and storage.

8. Example

Dataset:

• T1: {a1, a2, ..., a100}

• T2: {a1, a2, ..., a50}

• minsup = 1

➤ Frequent itemsets: All subsets of T1 and T2

Total = 2^100 - 1 → Too large!

➤ Closed frequent itemsets:


• {a1, ..., a50} → support: 2

• {a1, ..., a100} → support: 1

➤ Maximal frequent itemset:

• {a1, ..., a100} only (as its superset doesn't exist)

9. Association Rule Mining Steps

1. Find all frequent itemsets (support ≥ minsup).

2. Generate strong association rules from those itemsets (confidence ≥ minconf).

3.

Apriori Algorithm – Overview


• Purpose: To mine frequent itemsets for Boolean association rules.

• Proposed by: R. Agrawal and R. Srikant (1994).

• Name Origin: Uses prior knowledge of itemset properties.

Working Principle

• Level-wise iterative approach:

o Finds frequent 1-itemsets (L1) from the database.

o Uses Lk-1 to generate Lk (frequent k-itemsets).

o Iterates until no more frequent itemsets can be found.

Apriori Property (Antimonotonicity)

• Definition: All non-empty subsets of a frequent itemset must also be frequent.

• Implication:

o If itemset I is infrequent, then any superset I ∪ A is also infrequent.

o Helps in pruning the candidate space (reducing computations).

Two-Step Process (Join & Prune)


1. Join Step:

o Generate candidate itemsets Ck by self-joining Lk-1.

o Join l1 and l2 in Lk-1 if their first k-2 items are the same.

o Ensures no duplicates using lexicographic order.

2. Prune Step:

o Remove candidate c ∈ Ck if any of its (k-1)-subsets is not in Lk-1.

Example (Using AllElectronics DB)

• Database D: 9 transactions (T100–T900).

• min_sup = 2 (support count).

• Iterations:

o C1 → L1: All 1-itemsets satisfying min_sup.

o C2 → L2: 2-itemsets from L1 × L1; all subsets are frequent → no pruning.

o C3 → L3: Prune itemsets with infrequent subsets using the Apriori


property.

o C4: Generated but pruned entirely due to infrequent subset →


termination.

Generating Association Rules


Definitions

• Frequent Itemset: An itemset whose support ≥ minimum support threshold.

• Association Rule: An implication of the form A → B, where A and B are itemsets.

• Support Count: Number of transactions containing a given itemset.

• Confidence: Measures how often items in B appear in transactions that contain


A:

Support(𝐴 ∪ 𝐵)
Confidence(𝐴 → 𝐵) =
Support(𝐴)
Steps to Generate Association Rules

1. Find all frequent itemsets using algorithms like Apriori or FP-Growth.

2. For each frequent itemset l:

o Generate all non-empty subsets s of l.

o For each s, form the rule:

s→(l−s)s \rightarrow (l - s)s→(l−s)

o Compute the confidence of each rule.

3. Filter strong rules:

o Only keep rules with confidence ≥ min_conf.

o All rules automatically satisfy min_support because they are derived


from frequent itemsets.

Example

Let X={I1,I2,I5}X = \{I1, I2, I5\}X={I1,I2,I5} be a frequent itemset.

Non-empty subsets of X:

• {I1}, {I2}, {I5}, {I1, I2}, {I1, I5}, {I2, I5}

Possible rules and confidences (assuming support counts):

• {I1, I2} → {I5} → 50%

• {I1, I5} → {I2} → 100%

• {I2, I5} → {I1} → 100%

• {I1} → {I2, I5} → 33%

• {I2} → {I1, I5} → 29%

• {I5} → {I1, I2} → 100%

With min_conf = 70%, strong rules:

• {I1, I5} → {I2}

• {I2, I5} → {I1}

• {I5} → {I1, I2}


FP-Growth Algorithm

Motivation

Apriori algorithm, though effective, suffers from:

• Huge candidate generation (e.g., 10⁴ 1-itemsets → 10⁷ 2-itemsets).

• Multiple full database scans and expensive pattern matching.

FP-Growth Solution

• Avoids candidate generation by using a divide-and-conquer strategy.

• Builds a compressed data structure called the FP-tree (Frequent Pattern


Tree).

• Recursively mines conditional FP-trees for frequent patterns.

How FP-Growth Works

1. First Database Scan

• Count support of all items → generate frequent 1-itemsets.

• Sort items in descending order of support → List L.

2. Build FP-Tree

• Start with a null root.

• For each transaction:

o Sort items according to L.

o Insert path into tree, sharing common prefixes.

o Increment node counts for existing prefixes.

• Maintain node-links for quick access via a header table.

3. Mine FP-Tree

For each item (starting from the least frequent in L):

• Construct Conditional Pattern Base (CPB):

o Paths in FP-tree ending with the item.


• Build Conditional FP-Tree from CPB.

• Recursively mine the conditional FP-tree.

Example Summary

Frequent Items (sorted by support):

L = {I2:7, I1:6, I3:6, I4:2, I5:2}

Mining I5:

• CPB: {I2, I1:1}, {I2, I1, I3:1}

• Conditional FP-tree: I2:2, I1:2

• Patterns: {I2, I5}, {I1, I5}, {I2, I1, I5}

Mining I4:

• CPB: {I2, I1:1}, {I2:1}

• Conditional FP-tree: I2:2

• Pattern: {I2, I4}

Mining I3:

• CPB: {I2, I1:2}, {I2:2}, {I1:2}

• Conditional FP-tree: Two branches

• Patterns: {I2, I3}, {I1, I3}, {I2, I1, I3}

Mining I1:

• CPB: {I2:4}

• Conditional FP-tree: I2:4

• Pattern: {I2, I1}

FP-Tree Benefits

• Compact representation of the database.

• Reduces:

o Database scans

o Candidate generation
o Search space

Recursive Mining

• Focuses on smaller projected databases.

• Combines prefix path with suffix pattern during recursion.

Why Pattern Evaluation Matters


• Even strong rules (high support & confidence) can be misleading or
uninteresting.

• This problem is especially common with:

o Low support thresholds

o Long patterns

Strong Rules May Be Misleading

• Example: The rule "buys computer games → buys videos" has:

o Support: 40%

o Confidence: 66%

• Seems strong, but actually misleading, because:


o The base probability of buying videos is 75%

o So buying games lowers the chance of buying videos (→ negative


correlation)

• Takeaway: Confidence does not imply true correlation

Correlation-Based Evaluation

To improve pattern evaluation, use correlation analysis.

Lift:

• Formula:
lift(A → B) = P(A ∩ B) / (P(A) * P(B))

• Interpretation:

o Lift > 1: Positive correlation

o Lift < 1: Negative correlation

o Lift = 1: No correlation

• In the example, lift = 0.89 → negative correlation

Chi-Square (χ²) Measure:

• Based on a contingency table of observed vs expected values

• Formula:
χ² = Σ (observed - expected)² / expected

• High χ² → Strong evidence of dependence

• In the example, χ² = 555.6, confirming negative correlation

Conclusion: Which Measures Are Interesting?

• Support & Confidence: Good for filtering, but insufficient

• Lift & Chi-square (χ²): Better for identifying true interestingness

o Capture actual dependencies (positive or negative)

Would you like a visual diagram to summarize this comparison of interestingness


measures?
4o

You said:

write as detaield notes

ChatGPT said:

Sure! Here's a detailed set of notes on the topic "Which Patterns Are Interesting? —
Pattern Evaluation Methods", covering all key points, examples, and measures
in a structured and easy-to-study format:

Pattern Evaluation Methods in Association Rule Mining

Overview

• Most association rule mining algorithms use the support–confidence


framework.

• Issue: Even with thresholds, many generated rules may still be uninteresting or
misleading.

• Especially problematic with:

o Low support thresholds

o Long patterns

• To tackle this, additional measures are used to assess the interestingness of


patterns more effectively.

Strong Rules Are Not Necessarily Interesting

Subjective vs Objective Interestingness:

• Subjective: Depends on user preferences, domain knowledge, etc.

• Objective: Based on statistical measures derived from the data.

Objective measures help filter out misleading rules before presenting them to users.

Example 6.7 – Misleading “Strong” Rule

Scenario:

• Data from AllElectronics about purchase behavior:


o Total transactions: 10,000

o Customers who bought:

▪ Computer games: 6000

▪ Videos: 7500

▪ Both: 4000

Discovered Rule:

buys(X, "computer games") → buys(X, "videos")


Support = 40%, Confidence = 66%

Analysis:

• Rule meets minimum support (30%) and confidence (60%) → Considered strong.

• BUT, the actual probability of buying videos is 75%.

o Hence, confidence (66%) < base probability (75%).

o Indicates negative correlation.

• Conclusion: Rule is misleading.

o Buying games actually reduces the likelihood of buying videos.

Key Insight:

• Confidence alone doesn't imply meaningful association.

• Need better metrics to reveal true correlations.

From Association Analysis to Correlation Analysis

Why Use Correlation Measures?

• Support & confidence cannot detect:

o Independence

o Negative correlation

• Correlation measures evaluate statistical dependency between items.

Lift Measure

Formula:
Lift(A → B) = P(A ∩ B) / (P(A) * P(B))

Interpretation:

• Lift > 1: Positive correlation (A implies B more often than by chance)

• Lift = 1: No correlation (independence)

• Lift < 1: Negative correlation (A implies B less than by chance)

Example 6.8 – Applying Lift

• From previous data:

o P(game) = 0.60

o P(video) = 0.75

o P(game ∩ video) = 0.40

Lift = 0.40 / (0.60 × 0.75) = 0.40 / 0.45 = 0.89

• Result: Since lift < 1 → Negative correlation

Lift detects negative correlation that confidence failed to reveal.

Chi-Square (χ²) Measure

Purpose:

• Tests independence between itemsets

• Based on observed vs. expected values in a contingency table

Formula:

χ² = Σ [(Observed − Expected)² / Expected]


Example 6.9 – Applying χ²

Contingency Table: Observed Values (Table 6.6)

Game ¬Game Row Total

Video 4000 3500 7500

¬Video 2000 500 2500

Col Total 6000 4000 10000

Expected Values (Table 6.7)

Game (Exp) ¬Game (Exp)

Video 4500 3000

¬Video 1500 1000

χ² Calculation:

markdown

CopyEdit

χ² = (4000−4500)² / 4500

+ (3500−3000)² / 3000

+ (2000−1500)² / 1500

+ (500−1000)² / 1000

χ² = 500²/4500 + 500²/3000 + 500²/1500 + 500²/1000

= 55.56 + 83.33 + 166.67 + 250.0

= **555.6**

Interpretation:

• χ² > 1 → Statistically significant deviation

• In this case:

o Observed joint occurrence (4000) < Expected (4500)

o Confirms negative correlation


Like Lift, Chi-square also detects the negative correlation missed by the confidence metric.

You might also like