SONA COLLEGE OF TECHNOLOGY
(Autonomous ) Salem - 5
U23IT953 - Data Warehousing
and Data Mining
Prepared by,
Kruthika P
Assistant Professor
Department of Information Technology
Sona college of technology.
1
Data Warehousing
Unit – 1
Introduction to data warehousing: A Multi-dimensional
Model, Types of Schema for Multi dimensional Model,
Architecture Implementation, from Datawarehouse to
Datamining.
Datacube technology: Methods for Datacube computation,
further development of datacube and OLAP Technology,
attribute oriented induction.
2
Chapter 4: Data Warehousing and On-line
Analytical Processing
Data Warehouse: Basic Concepts
Data Warehouse Modeling: Data Cube and
OLAP
Data Warehouse Design and Usage
Data Warehouse Implementation
Data Generalization by Attribute-Oriented
Induction
Summary
3
What is a Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained
separately from the organization’s operational database
Support information processing by providing a solid
platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
4
Data Warehouse—Subject-Oriented
Organized around major subjects, such as
customer, product, sales
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process
5
Data Warehouse—Integrated
Constructed by integrating multiple,
heterogeneous data sources
relational databases, flat files, on-line
transaction records
Data cleaning and data integration techniques
are applied.
Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
E.g., Hotel price: currency, tax, breakfast covered,
etc.
When data is moved to the warehouse, it is
converted.
6
Data Warehouse—Time Variant
The time horizon for the data warehouse is
significantly longer than that of operational systems
Operational database: current value data
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or
implicitly
But the key of operational data may or may not
contain “time element”
7
Data Warehouse—Nonvolatile
A physically separate store of data transformed
from the operational environment
Operational update of data does not occur in the
data warehouse environment
Does not require transaction processing,
recovery, and concurrency control mechanisms
Requires only two operations in data
accessing:
initial loading of data and access of data
8
Why a Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
Different functions and different data:
missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
Note: There are more and more systems which perform OLAP
analysis directly on relational databases
9
Difference between operational DB
and data wasrehouse
Data Mining: Concepts and
August 18, 2025 Techniques 10
Data Warehouse: A Multi-Tiered Architecture
Monitor
Metadata & OLAP Server
Other
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools
11
Data Warehouse: A Three-Tiered Architecture
Data Mining: Concepts and
August 18, 2025 Techniques 12
Data Warehouse Models
Enterprise warehouse
collects all of the information about subjects spanning
the entire organization
Data Mart
a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data
mart
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be
materialized
13
A recommended approach for data
warehouse development.
Data Mining: Concepts and
August 18, 2025 Techniques 14
Back end Tools
Extraction, Transformation, and Loading (ETL)
Data extraction
get data from multiple, heterogeneous, and external
sources
Data cleaning
detect errors in the data and rectify them when possible
Data transformation
convert data from legacy or host format to warehouse
format
Load
sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
Refresh
propagate the updates from the data sources to the
warehouse
15
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
Operational meta-data
data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit
trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging
policies
16
Chapter 4: Data Warehousing and On-line
Analytical Processing
Data Warehouse: Basic Concepts
Data Warehouse Modeling: Data Cube and
OLAP
Data Warehouse Design and Usage
Data Warehouse Implementation
Data Generalization by Attribute-Oriented
Induction
Summary
17
Multi dimensional model
A data warehouse & OLAP Tools is based on a
multidimensional data model which views data in the form of a
data cube.
A data cube allows data to be modeled and viewed in multiple
dimensions.
It is defined by dimensions and facts.
Dimensions: are the perspectives or entities with respect to
which an organization wants to keep records.
Dimension table : Each dimension may have a table associated
with it, which further describes the dimension.
Facts: numeric measures
Fact Table : A multidimensional data model is typically
organized around a central theme, such as sales which is
represented by fact table.
contains the names of the facts, or measures, as well as keys to
each of the related dimension tables
18
Examples of Facts:
Sales amount
Quantity sold
Profit
Revenue
Order count
Characteristics:
Numeric/measurable
Typically aggregatable (SUM, AVG, COUNT, etc.)
Linked to dimensions by foreign keys
Can be additive, semi-additive, or non-additive:
Additive: Can be summed across all dimensions (e.g., sales
amount)
Semi-additive: Can be summed across some dimensions (e.g.,
account balances over time)
Non-additive: Cannot be summed (e.g., percentages, ratios)
Data Mining: Concepts and
August 18, 2025 Techniques 19
Dimensions Table
Dimensions provide the context for the facts — they are
descriptive attributes that help answer the "who, what, when,
where, why" of a business process.
Examples of Dimensions:
Customer (name, location, age group)
Product (name, category, brand)
Time (date, month, year)
Location (city, region, country)
Salesperson (name, department)
Dimension tables can be specified by users or experts, or
automatically generated and adjusted based on data distributions.
Data Mining: Concepts and
August 18, 2025 Techniques 20
Characteristics:
Contain textual or categorical data
Used for filtering, grouping, and
labeling facts
Typically stored in dimension tables
Data Mining: Concepts and
August 18, 2025 Techniques 21
Example for fact and dimension
tables
Data Mining: Concepts and
August 18, 2025 Techniques 22
Summary of Facts and Dimensions
Term Description
A perspective or category used to analyze data.
Dimension
Example: Time, Item, Location, Supplier.
A measurable value or metric. Example: Sales
Fact
Amount, Units Sold.
Central table with measurable facts and foreign
Fact Table keys to dimension tables.ie containing measures
+ keys to dimensions)
Provides context (attributes) for each
Dimension Table
dimension.
Data Mining: Concepts and
August 18, 2025 Techniques 23
Datawarehouse for AllElectronics
Data Mining: Concepts and
August 18, 2025 Techniques 24
2D Representation
In this 2-D representation, the sales for Vancouver are shown with respect to
the time dimension (organized in quarters) and the item dimension (organized
according to the types of items sold). The fact or
measure displayed is dollars_sold (in thousands).
Data Mining: Concepts and
August 18, 2025 Techniques 25
3D Representation
Data Mining: Concepts and
August 18, 2025 Techniques 26
4D Representation
Data Mining: Concepts and
August 18, 2025 Techniques 27
Lattice of Cuboids
Data Cube as Lattice of Cuboids
Each subset of dimensions gives a new view of the data:
0-D (Apex Cuboid): Total sales (no dimension breakdown)
1-D Cuboids: e.g., total sales by time
2-D Cuboids: e.g., sales by item and time
3-D Cuboids: e.g., sales by item, time, location
4-D Cuboid (Base): Most detailed data (time + item +
location + supplier)
Data Mining: Concepts and
August 18, 2025 Techniques 28
Base Cuboid (Most Detailed
Data)
The base cuboid contains the lowest level of summarization.
It shows the raw, detailed data — for every possible
combination of all dimensions.
Example (4 Dimensions):
Time = Q1, Q2, Q3, Q4
Item = Computer, Phone, etc.
Location = Chicago, New York, etc.
Supplier = SUP1, SUP2, SUP3
Base Cuboid will include sales for:
Q1, Computer, Chicago, SUP1
Q1, Phone, New York, SUP3
Q2, Entertainment, Vancouver, SUP2
…and so on — every possible combination!
Think of it like the bottom row of your cube — the most granular,
row-by-row data.
Data Mining: Concepts and
August 18, 2025 Techniques 29
Apex Cuboid (Most
Summarized Data)
The apex cuboid contains the highest level
of summarization.
It is the total or grand summary of the
data, aggregated across all dimensions.
We summarize over all times, items,
locations, and suppliers.
Apex Cuboid just shows:
Total Sales = ₹50,000 (e.g.)
Think of it like the top of a pyramid — just
one number, fully summarized
Data Mining: Concepts and
August 18, 2025 Techniques 30
In data warehousing literature, an n-D base
cube is called a base cuboid.
The top most 0-D cuboid, which holds the
highest-level of summarization, is called
the apex cuboid.
The lattice of cuboids forms a data cube.
Data Mining: Concepts and
August 18, 2025 Techniques 31
Cube: A Lattice of Cuboids
all
0-D (apex) cuboid
time item location supplier
1-D cuboids
time,location item,location location,supplier
time,item 2-D cuboids
time,supplier item,supplier
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D (base) cuboid
time, item, location, supplier
32
Dense cube vs Sparse cube
Sparse Cube:
Definition: A sparse cube is one where
most combinations of dimensions do not
have associated data (i.e., they're empty).
Dense Cube:
Definition: A dense cube is one where
almost every possible combination of
dimension values has a corresponding data
value.
Data Mining: Concepts and
August 18, 2025 Techniques 33
Dense cube vs Sparse cube
If you have:
|A| = 100 products
|B| = 50 stores
|C| = 12 months
Total possible combinations (in a dense cube):
100 × 50 × 12 = 60,000
But maybe only 5,000 combinations actually exist
(because many products aren’t sold in some stores
or months).
So:
Dense cube would store 60,000 cells.
Sparse cube would store only 5,000 cells.
Data Mining: Concepts and
August 18, 2025 Techniques 34
Problems
Suppose a base cuboid has three dimensions:
A, B, and C, with the following number of
cells:
|A| = 80
|B| = 15
|C| = 25
Each cube cell stores one measure of 8 bytes.
What is the total size of the computed
cube assuming the cube is dense?
Data Mining: Concepts and
August 18, 2025 Techniques 35
Step 1
Number of cuboids
Since there are 3 dimensions, the total number of cuboids is
2^3 = 8.
Now calculate the number of cells in each cuboid:
(A, B, C) = 80 × 15 × 25 = 30,000
(A, B) = 80 × 15 = 1,200
(A, C) = 80 × 25 = 2,000
(B, C) = 15 × 25 = 375
(A) = 80
(B) = 15
(C) = 25
() = 1#empty
Total number of cells = 30,000 + 1,200 + 2,000 + 375 + 80 +
15 + 25 + 1 =33,696
Data Mining: Concepts and
August 18, 2025 Techniques 36
Step 2
Multiply by bytes per cell
Each cell uses 8 bytes.
Total size = 33,696 × 8 = 269,568 bytes
Lets Try
Suppose a base cuboid has three dimensions: A, B, and C, with
the following number of cells:
|A| = 60
|B| = 12
|C| = 30
Each cube cell stores one measure of 2 bytes.
What is the total size of the computed cube assuming the cube
is dense?
Data Mining: Concepts and
August 18, 2025 Techniques 37
For Sparse Cube
No need to calculate the total number of
possible cells, because only non-empty
cells are stored.
So, if given the number of non-empty cells
is 10.
Then the total size is 10×8=80 bytes.
By assuming sparsity is around 1% to 10%
Estimated Size
of total no of cells then,
Sparsity (%) Non-empty Cells
(bytes)
100% (Dense) 33,696 33,696 × 8 = 269,568
10% 3,370 3,370 × 8 = 26,960
5% 1,684 1,684 × 8 = 13,472
1% 337 337 × 8 = 2,696
Data Mining: Concepts and
Exact case
August 18, 2025
10 Techniques
10 × 8 = 80 38
Schema for multi-dimensional
model
The entity-relationship data model is appropriate
for online transaction processing.
A data warehouse requires a concise, subject-
oriented schema that facilitates online data
analysis.
A popular data model for a data warehouse is a
multidimensional model, which can exist in the
form of
a star schema,
a snowflake schema,
a fact constellation schema.
Data Mining: Concepts and
August 18, 2025 Techniques 39
Star schema
a large central table (fact table)
containing the bulk of the data, with no
redundancy.
a set of smaller attendant tables
(dimension tables), one for each
dimension but may contain redundancy.
A fact table in the middle connected to a
set of dimension tables.
40
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
41
:
Snowflake schema
A refinement of star schema where some
dimensional hierarchy is normalized into
a set of smaller dimension tables,
forming a shape similar to snowflake.
The snowflake structure can reduce the
effectiveness of browsing, since more
joins will be needed to execute a query.
Data Mining: Concepts and
August 18, 2025 Techniques 42
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
43
Fact constellations
Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore
called galaxy schema or fact constellation.
Sophisticated applications may require
multiple fact tables to share dimension tables.
Data Mining: Concepts and
August 18, 2025 Techniques 44
Example of Fact
Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
branch location_key location to_location
branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 45
Examples for Defining Star, Snowflake,
and Fact Constellation Schemas
To define a multidimensional schema using a
SQL-based data mining query language like
DMQL (Data Mining Query Language):
DMQL provides two key language primitives:
define cube — to define the data cube (the main
structure that combines dimensions and
measures).
define dimension — to define each dimension
(the structure of categorical data).
Data Mining: Concepts and
August 18, 2025 Techniques 46
Schema for Data Warehouse and
Data Mart
A data warehouse collects information about subjects that
span the entire organization, such as customers, items,
sales, assets, and personnel, and thus its scope is enterprise-
wide. For data warehouses, the fact constellation schema
is commonly used since it can model multiple, interrelated
subjects.
A data mart, on the other hand, is a department subset of
the data warehouse that focuses on selected subjects, and
thus its scope is department-wide. For data marts, the star
or snowflake schema is commonly used, since both are
geared toward modeling
Data single subjects.
Mining: Concepts and
August 18, 2025 Techniques 51
A Concept Hierarchy:
Dimension (location)
all all
region Europe ... North_America
country Germany ... Spain Canada ... Mexico
city Frankfurt ... Vancouver ... Toronto
office L. Chan ... M. Wind
52
Concept Hierarchies and lattice
structures
Data Mining: Concepts and
August 18, 2025 Techniques 53
Data Cube Measures
A data cube measure is a numeric function
that can be evaluated at each point in the
data cube space.
Data cube space can be defined by a set of
dimension–value pairs; for example, (time
= “Q1”, location =“Vancouver”, item=
“computer”).
Data Mining: Concepts and
August 18, 2025 Techniques 54
Three Categories of Measures
Distributive: if the result derived by applying the
function to n aggregate values is the same as that
derived by applying the function on all the data without
partitioning
E.g., count(), sum(), min(), max()
Algebraic: if it can be computed by an algebraic
function with M arguments (where M is a bounded
integer), each of which is obtained by applying a
distributive aggregate function
E.g., avg(), min_N(), standard_deviation()
Holistic: if there is no constant bound on the storage
size needed to describe a subaggregate.
E.g., median(), mode(), rank()
55
Examples
Dataset: [3, 7, 2, 9]
Split into two groups: [3, 7] and [2, 9]
sum([3, 7]) = 10, sum([2, 9]) = 11, total
sum = 10 + 11 = 21
sum([3, 7, 2, 9]) = 21 → same result
So, sum() is distributive.
Group 1: [3, 7] → sum = 10, count = 2
Group 2: [2, 9] → sum = 11, count = 2
Total avg = (10 + 11) / (2 + 2) = 21 / 4 =
5.25
So, avg() is algebraic – it uses sum and
count,
August 18, 2025 which are distributive.
Data Mining: Concepts and
Techniques 56
Examples
Dataset: [3, 7, 2, 9]
Split into: [3, 7] and [2, 9]
You can’t find the median of the full
dataset just by knowing the medians of
each group.
Median of [3, 7] = 5
Median of [2, 9] = 5.5
But overall median of [2, 3, 7, 9] = (3 +
7)/2 = 5 → doesn’t match
❌ So, median() is not distributive or
algebraic → it's holistic
Data Mining: Concepts and
August 18, 2025 Techniques 57
Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D
planes
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to
its back-end relational tables (using SQL)
58
Comparison Chart
Example From Our
Operation What it does
Table
Aggregates up (e.g., City → Group sales by Country
Roll-up
Country) and Month
Drill-down Gets more detailed Go from Country to City
Slice Fix one dimension Filter for Product = iPhone
Filter by multiple City = Chicago, Product IN
Dice
dimensions (iPhone, Galaxy)
Turn rows → columns for
Pivot Rotate cube layout
Product by City
Combine sales with
Drill-across Join multiple fact tables
inventory
View actual sales records
Drill-through Go to raw transactions
for one summary cell
Data Mining: Concepts and
August 18, 2025 Techniques 59
OLAP operations with Example
Date City Product Category Sales
2025-07-01 Chicago iPhone Mobile 1000
2025-07-01 Chicago Galaxy Mobile 900
2025-07-01 New York iPhone Mobile 1100
2025-07-01 New York LaptopX Laptop 1200
2025-07-02 Chicago iPhone Mobile 1000
2025-07-02 New York Galaxy Mobile 950
2025-07-02 Chicago LaptopX Laptop 1300
Data Mining: Concepts and
August 18, 2025 Techniques 60
🔼 1. Roll-up
Goal: Go from detailed → summarized (e.g.,
City → State, Date → Month)
Assume:
Chicago & New York → USA
July 2025 is the only month shown
Query/Transformation:
Group by Country and Month:
Country Month Sales
USA July 7450
Data Mining: Concepts and
August 18, 2025 Techniques 61
🔽 2. Drill-down
Goal: Go from summarized → detailed
Country Month Sales
USA July 7450
Drill down to City and Date:
Date City Sales
2025-07-01 Chicago 1900
2025-07-01 New York 2300
2025-07-02 Chicago 2300
2025-07-02 New York 950
Data Mining: Concepts and
August 18, 2025 Techniques 62
🔪 3. Slice
Goal: Fix one dimension (select specific
value)
Example: Slice on Product = 'iPhone'
You removed all non-iPhone rows.
Date City Product Sales
2025-07-01 Chicago iPhone 1000
2025-07-01 New York iPhone 1100
2025-07-02 Chicago iPhone 1000
Data Mining: Concepts and
August 18, 2025 Techniques 63
🎲 4. Dice
Goal: Filter on multiple dimensions
Example:
Product IN ('iPhone', 'Galaxy')
City IN ('Chicago')
Date = 2025-07-01
Result:
Date City Product Sales
2025-07-01 Chicago iPhone 1000
2025-07-01 Chicago Galaxy 900
Data Mining: Concepts and
August 18, 2025 Techniques 64
🔄 5. Pivot (Rotate)
Goal: Change data layout for visualization
Before (standard table):
City Product Sales
Chicago iPhone 1000
Chicago Galaxy 900
New York iPhone 1100
iPhone Galaxy
Rows: City Chicago 1000 900
Columns: Product New York 1100 -
Values: Sales
Data Mining: Concepts and
August 18, 2025 Techniques 65
🔁 6. Drill-across
Goal: Compare multiple fact tables side-
by-side
Let’s say you have another table:
inventory_data
Date City Product Inventory
2025-07-01 Chicago iPhone 30
2025-07-01 Chicago Galaxy 25
Drill-across:
Join
Datesales_data
City andProduct
inventory_data:
Sales Inventory
2025-07-01 Chicago iPhone 1000 30
2025-07-01 Chicago Galaxy 900 25
Data Mining: Concepts and
August 18, 2025 Techniques 66
7. Drill-through
Goal: From cube (aggregated summary) to
raw, transactional records
Suppose your summary cube says:
City Product Sales
Chicago iPhone 2000
You drill through to raw transaction
records:
Transactio CustomerI
Date City Product Sales
nID D
2025-07-
TX001 C123 Chicago iPhone 1000
01
2025-07-
TX002 C456 Chicago iPhone 1000
02
This shows the real data entries that make up the
cube cell. Data Mining: Concepts and
August 18, 2025 Techniques 67
68
Chapter 4: Data Warehousing and On-line
Analytical Processing
Data Warehouse: Basic Concepts
Data Warehouse Modeling: Data Cube and
OLAP
Data warehouse to data mining
Data Generalization by Attribute-Oriented
Induction
Summary
69
Datawarehouse Applications
Information processing supports querying, basic statistical analysis,
and reporting using crosstabs, tables, charts, or graphs. A current trend
in data warehouse information processing is to construct low-cost Web-
based accessing tools that are then integrated with Web browsers.
Analytical processing supports basic OLAP operations, including slice-
and-dice, drill-down, roll-up, and pivoting. It generally operates on
historical data in both summarized and detailed forms. The major
strength of on-line analytical processing over information processing is
the multidimensional data analysis of data warehouse data.
Data mining supports knowledge discovery by finding hidden patterns
and associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools.
Data Mining: Concepts and
August 18, 2025 Techniques 70
Information processing is not data mining
Information processing, based on queries, can find
useful information.
However, answers to such queries reflect the
information directly stored in databases or
computable by aggregate functions.
They do not reflect sophisticated patterns or
regularities buried in the database.
Data Mining: Concepts and
August 18, 2025 Techniques 71
OLAP and DM
OLAP is a data summarization/aggregation tool that helps
simplify data analysis.
Data mining allows the automated discovery of implicit
patterns and interesting knowledge hidden in large amounts of
data.
OLAP tools are targeted toward simplifying and supporting
interactive data analysis.
data mining tools is to automate as much of the process as
possible, while still allowing users to guide the process.
data mining goes one step beyond traditional on-line analytical
processing. Data Mining: Concepts and
August 18, 2025 Techniques 72
Data Mining
Data, mining covers both data description and data
modeling. It performs not only data summary and
comparison but also association, classification, prediction,
clustering, time-series analysis, and other data analysis
tasks.
It may analyze data existing at more detailed granularities
than the summarized data provided in a data warehouse.
It may also analyze transactional, spatial, textual, and
multimedia data that are difficult to model with current
multidimensional database technology.
Data Mining: Concepts and
August 18, 2025 Techniques 73
to
On-Line Analytical Mining
On-line analytical mining (OLAM) (also called OLAP
mining) integrates on-line analytical processing
(OLAP) with data mining and mining knowledge in
multidimensional databases.
Importance:
High quality of data in data warehouses.
Available information processing
infrastructure surrounding data
warehouses.
OLAP-based exploratory data analysis.
On-line selection of data mining functions.
Data Mining: Concepts and
August 18, 2025 Techniques 74
Data Mining: Concepts and
August 18, 2025 Techniques 75
Design of Data Warehouse: A
Business Analysis Framework
Four views regarding the design of a data
warehouse
Top-down view
allows selection of the relevant information necessary
for the data warehouse
Data source view
exposes the information being captured, stored, and
managed by operational systems
Data warehouse view
consists of fact tables and dimension tables
Business query view
sees the perspectives of data in the warehouse from
the view of end-user
76
Data Warehouse Design
Process
Top-down, bottom-up approaches or a combination of both
Top-down: Starts with overall design and planning (mature)
Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view
Waterfall: structured and systematic analysis at each step
before proceeding to the next
Spiral: rapid generation of increasingly functional systems,
short turn around time, quick turn around
Typical data warehouse design process
Choose a business process to model, e.g., orders, invoices, etc.
Choose the grain (atomic level of data) of the business process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
77
Data Warehouse
Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Data Data Enterprise
Mart Mart Data
Warehouse
Model refinement Model refinement
Define a high-level corporate data model
78
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling,
pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting
the mining results using visualization tools
79
From On-Line Analytical Processing
(OLAP)
to On Line Analytical Mining (OLAM)
Why online analytical mining?
High quality of data in data warehouses
DW contains integrated, consistent, cleaned
data
Available information processing structure
surrounding data warehouses
ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools
OLAP-based exploratory data analysis
Mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
Integration and swapping of multiple mining
functions, algorithms, and tasks
80
Chapter 4: Data Warehousing and On-line
Analytical Processing
Data Warehouse: Basic Concepts
Data Warehouse Modeling: Data Cube and
OLAP
Data Warehouse Design and Usage
Data Warehouse Implementation
Data Generalization by Attribute-Oriented
Induction
Summary
81
The “Compute Cube” Operator
Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
Transform it into a SQL-like language (with a new operator
cube by, introduced by Gray et al.’96)
()
SELECT item, city, year, SUM (amount)
FROM SALES
(city) (item) (ye
CUBE BY item, city, year
-- we also use group by
Need compute the following Group-Bys (city, item) (city, year) (item
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer) (city, item, year)
()
82
Efficient Data Cube
Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with
L levels? n
T ( Li 1)
You have 2 dimensions: i 1
Time has 2 levels: Month and Year → So L₁ =
2
Product has 1 level: Category → So L₂ = 1
Now apply the formula:
Total Cuboids = (L₁ + 1) × (L₂ + 1)
Total Cuboids = (2 + 1) × (1 + 1) = 3 × 2 = 6
83
Curse of Dimensionality
If you have many dimensions, the number
of cuboids grows very fast:
For n dimensions, 2ⁿ cuboids
Example: If 10 dimensions, each with 5 levels
→ (5+1)^10=6¹⁰ = over 60 million cuboids!
This causes huge storage and processing
issues, which is called the curse of
dimensionality.
Solution: Materialization
We don’t compute/store all cuboids. Instead,
we use Materialization strategies:
Data Mining: Concepts and
August 18, 2025 Techniques 84
Types of Materialization of data
cube
Full Materialization-Precompute and store all cuboids
Advantage: Very fast query response — no need to
calculate anything at query time
Disadvantage: Needs a lot of storage and time
No Materialization: Don’t store any cuboids — compute
everything on the fly
Advantage: Saves space
Disadvantage: Slow query response
Partial Materialization: (Best in practice)
Precompute and store only the most useful cuboids
Use when: You want a good balance between speed
and storage
Data Mining: Concepts and
August 18, 2025 Techniques 85
Indexing OLAP Data: Bitmap
Index
Bitmap indexing represents attribute values using bit vectors
(bitmaps) to allow fast querying and aggregation, especially in low-
cardinality domains (i.e., when an attribute has few distinct values).
Index on a particular column Each value in the column has a bit
vector: bit-op is fast
The length of the bit vector: No. of records in the base table
a bit vector for each value of a single attribute.
not suitable for high cardinality domains
Base table Index on Region Index on Type
Cust Region Type RecIDAsia Europe America RecID Retail Dealer
C1 Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 America Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1
87
Cust Region Type
C1 Asia Retail RecIDAsia Europe America
C2 Europe Dealer 1 1 0 0
C3 Asia Dealer 2 0 1 0
C4 America Retail 3 1 0 0
C5 Europe Dealer 4 0 0 1
5 0 1 0
Region Bitmap
Asia 10100
Europe 01001
America 00010
Data Mining: Concepts and
August 18, 2025 Techniques 88
Indexing OLAP Data: Join Indices
Join Indexing — Fast Join Between Tables
Use Case:
We want to precompute joins between a
fact table and dimension table for faster
access. OrderID CustomerID Amount
5001 101 $250
5002 102 $400
5003 101 $180
CustomerID Name
101 Alice
102 Bob
CustomerID OrderIDs
101 5001, 5003
102 5002
89
Efficient Processing OLAP Queries
The purpose of materializing cuboids and constructing OLAP index
structures is to speed up query processing in data cubes
Determine which operations should be performed on the
available cuboids
Transform drill, roll, etc. into corresponding SQL and/or OLAP
operations, e.g., dice = selection + projection
Determine which materialized cuboid(s) should be selected for
OLAP op.
sales cube [time, item, location] : sum(sales in dollars)
The cube tracks sales amounts (measure)
Along three dimensions:
Time (with hierarchy: day < month < quarter < year)
Item (with hierarchy: item name < brand < type)
Location (with hierarchy: street < city < province/state < country)
90
The user wants to know:
Total sales by brand and province or state, where year = 2004
Filters on year = 2004
Groups by: brand and province or state
Available Materialized Cuboids (Precomputed)
We have 4 cuboids stored in advance:
Which Cuboid(s) Can Be Used?
Cuboid #
Dimensions in the Cuboid
We need a cuboid that can satisfy: Grouping by: brand, province/state
Selection: year = 2004
1 {year, item name, city}
2 {year, brand, country}
Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized
3 cuboids available: {year, brand, province or state}
1) {year, item_name, city}
4 2) {year, brand, country}{item name, province or state} with filter year = 2004
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
Dataarray
Explore indexing structures and compressed vs. dense Mining: Concepts
structs in MOLAP and
August 18, 2025 Techniques 91
Cuboid # Dimensions in the Cuboid
1 {year, item name, city}
2 {year, brand, country}
3 {year, brand, province or state}
4 {item name, province or state} with filter year = 2004
Find Total sales by brand and province or state,
where year
C1: item name = 2004
→ more detailed than brand (Roll Up) Usable, but not
optimal.
C2:country is more general (higher level in hierarchy) Cannot drill down
from country → province/state . Not usable
C3:Exact match for what we need:Group by: brand, province/stateFilter:
year = 2004 (available)Best match, likely smallest size: Ideal choice
C4: item name (finer than brand → can roll up)province/state (good)year
= 2004 already selected (filter is applied)Might be efficient if well-
indexed. Medium Choice
In real systems, we use cost-based estimation to choose the cuboid with:
Lowest processing time
Smallest I/O
Efficient indexing Data Mining: Concepts and
August 18, 2025 Techniques 92
Why Can’t You Drill Down from
country? C2
Because of data loss during aggregation.
Let’s say you have a cuboid:
Year Brand Country Sum(Sales)
2004 Samsung USA $500,000
2004 Samsung Canada $200,000
Year Brand Province Sum(Sales)
2004 Samsung California ???
2004 Samsung Texas ???
Data Mining: Concepts and
August 18, 2025 Techniques 93
Additional Strategies for Fast
Query Responses
MOLAP- achieves fast OLAP query responses by using optimized array
structures, compression for sparsity, and intelligent indexing
Online Aggregation (Real-time approximate results)
System shows partial/estimated results while the full query is still
processing.
These results are:
Refined over time as more data is processed.
Accompanied by confidence intervals (estimates of accuracy).
Benefit: Users can get early feedback and adjust their queries
without waiting.
Top-N Queries
Instead of getting all records sorted by some metric (like sales), you
ask only for the Top N results.
E.g., “Show top 10 selling products.”
This reduces processing time and improves interactivity.
Ideal for dashboards or quick decision-making.
Data Mining: Concepts and
August 18, 2025 Techniques 94
OLAP Server Architectures
Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware
Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
Greater scalability
Multidimensional OLAP (MOLAP)
Sparse array-based multidimensional storage engine
Fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
Flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers (e.g., Redbricks)
Specialized support for SQL queries over star/snowflake
schemas
95
Chapter 4: Data Warehousing and On-line
Analytical Processing
Data Warehouse: Basic Concepts
Data Warehouse Modeling: Data Cube and
OLAP
Data Warehouse Design and Usage
Data cube Technology
Data Generalization by Attribute-Oriented
Induction
Summary
96
Data generalization DataCube
Computation
Data generalization is the process of
summarizing and abstracting a large
amount of task-relevant data in a database,
moving from a detailed (low-level) to a
more general (high-level) form.
Data Mining: Concepts and
August 18, 2025 Techniques 97
Data Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time item location supplier
1-D cuboids
time,location item,location location,supplier
2-D cuboids
time,item time,supplier item,supplier
time,location,supplier
3-D cuboids
time,item,locationtime,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplierc
98
Data Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time item location supplier
1-D cuboids
time,item time,location item,location location,supplier
time,supplier item,supplier 2-D cuboids
time,location,supplier
time,item,location item,location,supplier
3-D cuboids
time,item,supplier
time, item, location, supplier 4-D(base) cuboid
Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child
cells
1. (9/15, milk, Urbana, Dairy_land)
2. (9/15, milk, Urbana, *)
3. (*, milk, Urbana, *)
4. (*, milk, Urbana, *)
5. (*, milk, Chicago, *)
6. (*, milk, *, *)
99
Iceberg Cube, Closed Cube & Cube Shell
Is iceberg cube good enough?
2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}
How many cells will the iceberg cube have if having count(*)
>= 10? Hint: A huge but tricky number!
Close cube:
Closed cell c: if there exists no cell d, s.t. d is a descendant of
c, and d has the same measure value as c.
Closed cube: a cube consisting of only closed cells
What is the closed cube of the above base cuboid? Hint: only
3 cells
Cube Shell
Precompute only the cuboids involving a small # of
dimensions, e.g., 3
For (A1, A2, … A10), how many combinations to
More dimension combinations will need to be computed on
compute?
the fly
100
Efficient Methods for Data Cube
Computation
A Road Map for the Materialization of Different
Kinds of Cubes
Multiway Array Aggregation for Full Cube
Computation
BUC: Computing Iceberg Cubes from the Apex
Cuboid Downward
Star-Cubing: Computing Iceberg Cubes Using a
Dynamic Star-Tree Structure
Precomputing Shell Fragments for Fast High-
Dimensional OLAP
Computing Cubes with Complex Iceberg
Conditions
Data Mining: Concepts and
August 18, 2025 Techniques 101
A Road Map for the Materialization
of Different Kinds of Cubes
Iceberg Cubes
Store only those cells above a threshold
(e.g., sales ≥ $100). Saves space by
ignoring low-value or empty data. Can be
defined in SQL using a HAVING clause.
SELECT month, city, customer_group,
COUNT(*)
FROM salesInfo
CUBE BY month, city, customer_group
HAVING COUNT(*) >= 10;
Data Mining: Concepts and
August 18, 2025 Techniques 102
Data cube computation is an essential task in DWH. Pre-
computation of all or part of a data cube can greatly reduce
the response time and increase performance of OLAP. This is
a challenging task.
How can we compute data cubes in advance?
Cube Materialization:
(i) Base Cell:- A cell in base cuboid is base cell.
(ii) Aggregate Cell:- A cell in non-base cuboid. Each
aggregated dimension indicated as '*'.
Ex:- If we take 3D-cube all 1-D, 2-D cells are aggregate cells
and 3-D is Base cell.
(iii) Ancestor & Descendent cells:- 1-D and 2-D cells are
ancestors of 3-D cell. 3-D is descendent cell.
Data Mining: Concepts and
August 18, 2025 Techniques 103
Full cube:- Computation all the cells of all the cuboids
in data cube.
Iceberg cube:- Computing only the cuboid cells whose
measure satisfies the iceberg condition. EX:- Only a
small portion of cells may be "above the water" in a
sparse cube
compute cube sales iceberg as
select month, city, customer_group, count(*)
from salesInfo
cube by month, city, customer_group having
count(*) >= min support
Closed Cube:- The cube consists of only closed cells. A
cell that has no descendant cell than it is closed cell.
Data Mining: Concepts and
August 18, 2025 Techniques 104
General Strategies for Data
Cube Computation
1. Sorting, Hashing, and Grouping
Purpose: Cluster and organize data to
facilitate aggregation.
Sort or hash dimension attributes to group
tuples (rows) with the same values.
This helps aggregate values efficiently
(e.g., total sales by branch/day/item).
Shared-sorts and shared-partitions reduce
redundant sorting and partitioning.
Data Mining: Concepts and
August 18, 2025 Techniques 105
2. Simultaneous Aggregation and
Caching Intermediate Results
Purpose: Avoid redundant reads and
recalculations.
Use lower-level aggregates to compute
higher-level ones.
Cache intermediate results to reduce disk
I/O.
Techniques like amortized scans compute
multiple cuboids in a single pass.
Data Mining: Concepts and
August 18, 2025 Techniques 106
3. Aggregation from the Smallest Child
Purpose: Use the least costly child
cuboid for parent computation.
If multiple child cuboids exist, choose the
one with fewer distinct values.
Example: Computing sales by branch from
(branch, year) is faster than from (branch,
item) if there are fewer years than items.
Data Mining: Concepts and
August 18, 2025 Techniques 107
4. Apriori Pruning (for Iceberg Cubes)
Purpose: Eliminate unimportant or low-
support cells early.
Apriori property: If a cell doesn’t meet
the threshold (e.g., count < 10), its
descendants won’t either.
Saves time and space by avoiding
unnecessary computation.
Works well with antimonotonic
measures (like count or sum).
Data Mining: Concepts and
August 18, 2025 Techniques 108
4. Apriori Pruning (for Iceberg Cubes)
Purpose: Eliminate unimportant or low-
support cells early.
Apriori property: If a cell doesn’t meet
the threshold (e.g., count < 10), its
descendants won’t either.
Saves time and space by avoiding
unnecessary computation.
Works well with antimonotonic
measures (like count or sum).
Data Mining: Concepts and
August 18, 2025 Techniques 109
Iceberg cubes materialize only significant
cells, where values meet a minimum
support threshold.
Apriori pruning reduces exploration of
irrelevant aggregate cells.
Data Mining: Concepts and
August 18, 2025 Techniques 110
Computing Full/Iceberg Cubes:
3 Methodologies
Key
Methodology Type Strategy Best For
Technique
Full scan, Array-based
Multi-Way Bottom-Up Full cubes
partitioned aggregation
Recursive Iceberg cubes, Apriori-style
BUC Top-Down
pruning skewed data pruning
Bottom-up + Iceberg cubes, Star-tree,
Star-Cubing Hybrid
Top-down large data shared dims
Data Mining: Concepts and
August 18, 2025 Techniques 111
Method Description
MultiWay Full cube computation using arrays
Iceberg cubes from top (apex)
BUC
downward
Star-Cubing Combines top-down and bottom-up
Efficient computation for high-
Shell Fragments
dimensional cubes
Deals with complex measures like
Complex Iceberg Cubes
averages
Data Mining: Concepts and
August 18, 2025 Techniques 112
Data Cube Technology
Data Cube Computation Methods
Multi-Way Array Aggregation
BUC
Star-Cubing
High-Dimensional OLAP
113
Multi-Way Array Aggregation
All
Array-based “bottom-up”
algorithm
A B C
Using multi-dimensional chunks
No direct tuple comparisons
AB AC BC
Simultaneous aggregation on
multiple dimensions
ABC
Intermediate aggregate values are
re-used for computing ancestor
cuboids
Cannot do Apriori pruning: No
iceberg optimization
114
Multi-way Array Aggregation for Cube
Computation (MOLAP)
Partition arrays into chunks (a small subcube which fits in memory).
Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces
memory access and storage cost.
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32 What is the best
c0
b3 B13 14 15 16 60 traversing order
44
9
28 56 to do multi-way
b2
B 40
24 52 aggregation?
b1 5 36
20
b0 1 2 3 4
a0 a1 a2 a3
A 115
4*4*4=64 Chunks
What is the best traversing order to do multi-way aggregation?
Suppose the cardinality of A,B,C is 40,400,4000 respectively.
Thus the size of each partition is 10, 100, 1000 respectively.
Computation of all cuboids means ABC, AB, AC, BC, A, B, C,
Apex.
Possible orderings are -
(i) Ordering from 1,2, 3, 64. By scanning chunks of 1 to 4 of
ABC, computed (aggregated over a0 to a3). Multiway
aggregation used to avoid rescan of particular chunks. To
avoid 3-D chunk more than once into memory, the minimum
requirement for holding all relevant 2-D planes,
whole AB plane + one row of AC + one chunk of BC plane
i.e 40*400+40*1000 + 100 *1000 = 1,56,000 memory units
Data Mining: Concepts and
August 18, 2025 Techniques 116
(ii) Ordering from 1, 17, 33, 49, 5, 21, 37, 53.
whole BC plane + one row of AC plane +
one chunk of AB plane
i.e 400*4000 + 40*1000 + 10*100 =
16,41,000 memory units.
Therefore the most efficient one is 1,2,...64
chunks order.
Data Mining: Concepts and
August 18, 2025 Techniques 117
Multi-way Array Aggregation for Cube
Computation (3-D to 2-D)
All
A B C
a ll
AB AC BC
A B C
ABC
AB AC BC
The best order is
ABC
the one that
minimizes the
memory
requirement and
reduced I/Os
118
Multi-way Array Aggregation for Cube
Computation (2-D to 1-D)
All
A B C
AB AC BC
ABC
119
Bottom-Up Computation (BUC)
a ll
BUC is an algorithm ("bottom-up
constuction") for the computation of A B C D
sparse and iceberg cubes.
AB AC AD BC BD CD
Unlike Multiway, BUC constructs the
cube from the apex cuboid toward ABC ABD ACD BCD
the base cuboid which allows BUC
to share data partitioning costs. ABCD
This processing order also allows 1 a ll
BUC to prune during construction,
using the Apriori property. 2 A 10 B 14 C 16 D
The best order is the one that
minimizes the memory requirement 3 A B 7 AC 9 AD 11 BC 13 BD 15 C D
and reduced I/Os
No simultaneous aggregation 4 ABC 6 ABD 8 ACD 12 BC D
5 ABCD
120
Data Mining: Concepts and
August 18, 2025 Techniques 121
Algorithm Explanation
Initially, the algorithm is called with the input relation (set of tuples).
BUC aggregates the entire input (line 1) and writes the resulting total
(line 3). (Line 2 is an optimization feature that is discussed later in our
example.)
For each dimension d (line 4), the input is partitioned on d (line6). On
return from Partition(), dataCount contains the total number of tuples for
each distinct value of dimension d. Each distinct value of d forms its own
partition.
Line 8iterates through each partition. Line 10 tests the partition for
minimum support. That is, if the number of tuples in the partition
satisfies (i.e., is ) the minimum support, then the partition becomes the
input relation for a recursive call made to BUC, which computes the
iceberg cube on the partitions for dimensions d C1 to numDims (line 12).
Data Mining: Concepts and
August 18, 2025 Techniques 122
Note that for a full cube (i.e., where minimum
support in the having clause is 1), the
minimum support condition is always satisfied.
Thus, the recursive call descends one level deeper
into the lattice. On return from the recursive call,
we continue with the next partition for d. After all
the partitions have been processed, the entire
process is repeated for each of the remaining
dimensions.
Data Mining: Concepts and
August 18, 2025 Techniques 123
Star-Cubing: An Integrating Method
Computing Iceberg Cubes by Top-Down + Bottom-UP
Integration.
Explore shared dimensions
E.g., dimension A is the shared dimension of ACD and AD
ABD/AB means cuboid ABD has shared dimensions AB
Allows for shared computations
e.g., cuboid AB is computed simultaneously as ABD
C /C D
Aggregate in a top-down
manner but with the
A C /A C A D /A B C /B C B D /B CD
bottom-up sub-layer
underneath which will allow A C D /A
A B C /A B C A B D /A B BCD
Apriori pruning
Shared dimensions grow in A B C D /a ll
bottom-up fashion 124
Star-Cubing prunes (skips) unnecessary
computations during iceberg cube computation by
recognizing shared dimensions in a subtree of
cuboids.
For example, cuboid AB extending from ABD in Figure 4.8
would actually be pruned because AB was already computed
in ABD=AB. Similarly, cuboid A extending from AD would
also be pruned because it was already computed in ACD=A.
Data Mining: Concepts and
08/18/25 Techniques 125
Star-Cubing algorithm works, few more concepts,
namely, cuboid trees, star-nodes, and star-trees.
Each level in the tree represents a dimension, and
each node represents an attribute value. Each
node has four fields: the attribute value, aggregate
value, pointer(s) to possible descendant(s), and
pointer to possible sibling.
A fragment of th
base cuboid tree
Data Mining: Concepts and
August 18, 2025 Techniques 126
A node A is a star-node
➤ If the single-dimensional aggregate
on p does not satisfy the iceberg
condition.
Otherwise, p is a non-star-node.
A cuboid tree that is compressed using
star-nodes is called a: ➤ Star-Tree
Data Mining: Concepts and
August 18, 2025 Techniques 127
Base Cuboid Table (Table 4.1):
➤ Contains 5 tuples and 4 dimensions: A, B, C,
D
➤ Cardinalities:
A → 2 values
B, C, D → 4 values each
Iceberg Condition:
A ➤ MinimumB supportC (minsup)
D = 2 count
a1 b1 c1 d1 1
a1 b1 c4 d3 1
a1 b2 c2 d2 1
a2 b3 c3 d4 1
a2 b4 c3Mining: Concepts andd4
Data 1
August 18, 2025 Techniques 128
A B C D count
a1 b1 c1 d1 1
a1 b1 c4 d3 1
a1 b2 c2 d2 1
a2 b3 c3 d4 1
a2 b4 c3 d4 1
Data Mining: Concepts and
August 18, 2025 Techniques 129
A B C D count
a1 b1 c1 d1 1
a1 b1 c4 d3 1
a1 b2 c2 d2 1
a2 b3 c3 d4 1
a2 b4 c3Mining: Concepts andd4
Data 1
August 18, 2025 Techniques 130
Data Mining: Concepts and
August 18, 2025 Techniques 131
Chapter 4: Data Warehousing and On-line
Analytical Processing
Data Warehouse: Basic Concepts
Data Warehouse Modeling: Data Cube and
OLAP
Data Warehouse Design and Usage
Data cube Technology
Data Generalization by Attribute-Oriented
Induction
Summary
132
Attribute-Oriented Induction
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular
measures
How it is done?
Collect the task-relevant data (initial relation) using
a relational database query
Perform generalization by attribute removal or
attribute generalization
Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts
Interaction with users for knowledge presentation
133
Attribute-Oriented Induction: An
Example
Example: Describe general characteristics of graduate
students in the University database
Step 1. Fetch relevant set of data using an SQL
statement, e.g.,
Select * (i.e., name, gender, major, birth_place,
birth_date, residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }
Step 2. Perform attribute-oriented induction
Step 3. Present results in generalized relation, cross-tab,
or rule forms
134
Class Characterization: An Example
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Initial Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67
Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …
…
Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime M Science Canada 20-25 Richmond Very-good 16
Generalized F Science Foreign 25-30 Burnaby Excellent 22
Relation … … … … … … …
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
135
Basic Principles of Attribute-Oriented
Induction
Data focusing: task-relevant data, including dimensions,
and the result is the initial relation
Attribute-removal: remove attribute A if there is a large
set of distinct values for A but (1) there is no
generalization operator on A, or (2) A’s higher level
concepts are expressed in terms of other attributes
Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A
Attribute-threshold control: typical 2-8, specified/default
Generalized relation threshold control: control the final
relation/rule size
136
Attribute-Oriented Induction: Basic
Algorithm
InitialRel: Query processing of task-relevant data,
deriving the initial relation.
PreGen: Based on the analysis of the number of
distinct values in each attribute, determine
generalization plan for each attribute: removal? or
how high to generalize?
PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
Presentation: User interaction: (1) adjust levels by
drilling, (2) pivoting, (3) mapping into rules, cross
tabs, visualization presentations.
137
Presentation of Generalized
Results
Generalized relation:
Relations where some or all attributes are generalized, with
counts or other aggregation values accumulated.
Cross tabulation:
Mapping results into cross tabulation form (similar to
contingency tables).
Visualization techniques:
Pie charts, bar charts, curves, cubes, and other visual
forms.
Quantitative characteristic rules:
Mapping generalized result into characteristic rules with
grad ( x) male( x)
quantitative information associated with it, e.g.,
birth _ region( x) "Canada"[t :53%] birth _ region( x) " foreign"[t : 47%].
138
Mining Class Comparisons
Comparison: Comparing two or more classes
Method:
Partition the set of relevant data into the target class and
the contrasting class(es)
Generalize both classes to the same high level concepts
Compare tuples with the same high level descriptions
Present for every tuple its description and two measures
support - distribution within single class
comparison - distribution between classes
Highlight the tuples with strong discriminant features
Relevance Analysis:
Find attributes (features) which best distinguish different
classes
139
Chapter 4: Data Warehousing and On-line
Analytical Processing
Data Warehouse: Basic Concepts
Data Warehouse Modeling: Data Cube and
OLAP
Data Warehouse Design and Usage
Data Warehouse Implementation
Data Generalization by Attribute-Oriented
Induction
Summary
140
Summary
Data warehousing: A multi-dimensional model of a data warehouse
A data cube consists of dimensions & measures
Star schema, snowflake schema, fact constellations
OLAP operations: drilling, rolling, slicing, dicing and pivoting
Data Warehouse Architecture, Design, and Usage
Multi-tiered architecture
Business analysis design framework
Information processing, analytical processing, data mining,
OLAM (Online Analytical Mining)
Implementation: Efficient computation of data cubes
Partial vs. full vs. no materialization
Indexing OALP data: Bitmap index and join index
OLAP query processing
OLAP servers: ROLAP, MOLAP, HOLAP
Data generalization: Attribute-oriented induction
141