0% found this document useful (0 votes)
15 views141 pages

Unit 1

Uploaded by

kanishkasp2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views141 pages

Unit 1

Uploaded by

kanishkasp2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 141

SONA COLLEGE OF TECHNOLOGY

(Autonomous ) Salem - 5

U23IT953 - Data Warehousing


and Data Mining

Prepared by,
Kruthika P
Assistant Professor
Department of Information Technology
Sona college of technology.

1
Data Warehousing
Unit – 1
Introduction to data warehousing: A Multi-dimensional
Model, Types of Schema for Multi dimensional Model,
Architecture Implementation, from Datawarehouse to
Datamining.
Datacube technology: Methods for Datacube computation,
further development of datacube and OLAP Technology,
attribute oriented induction.

2
Chapter 4: Data Warehousing and On-line
Analytical Processing

 Data Warehouse: Basic Concepts


 Data Warehouse Modeling: Data Cube and
OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary
3
What is a Data Warehouse?
 Defined in many different ways, but not rigorously.
 A decision support database that is maintained
separately from the organization’s operational database
 Support information processing by providing a solid
platform of consolidated, historical data for analysis.
 “A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
 Data warehousing:
 The process of constructing and using data warehouses

4
Data Warehouse—Subject-Oriented

 Organized around major subjects, such as


customer, product, sales
 Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
 Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process

5
Data Warehouse—Integrated
 Constructed by integrating multiple,
heterogeneous data sources
 relational databases, flat files, on-line

transaction records
 Data cleaning and data integration techniques
are applied.
 Ensure consistency in naming conventions,

encoding structures, attribute measures, etc.


among different data sources

E.g., Hotel price: currency, tax, breakfast covered,
etc.
 When data is moved to the warehouse, it is
converted.
6
Data Warehouse—Time Variant

 The time horizon for the data warehouse is


significantly longer than that of operational systems
 Operational database: current value data
 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
 Contains an element of time, explicitly or
implicitly
 But the key of operational data may or may not
contain “time element”

7
Data Warehouse—Nonvolatile
 A physically separate store of data transformed
from the operational environment
 Operational update of data does not occur in the
data warehouse environment
 Does not require transaction processing,
recovery, and concurrency control mechanisms
 Requires only two operations in data
accessing:

initial loading of data and access of data

8
Why a Separate Data Warehouse?
 High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
 Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data:
 missing data: Decision support requires historical data which
operational DBs do not typically maintain
 data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
 Note: There are more and more systems which perform OLAP
analysis directly on relational databases
9
Difference between operational DB
and data wasrehouse

Data Mining: Concepts and


August 18, 2025 Techniques 10
Data Warehouse: A Multi-Tiered Architecture

Monitor
Metadata & OLAP Server
Other
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


11
Data Warehouse: A Three-Tiered Architecture

Data Mining: Concepts and


August 18, 2025 Techniques 12
Data Warehouse Models
 Enterprise warehouse
 collects all of the information about subjects spanning

the entire organization


 Data Mart
 a subset of corporate-wide data that is of value to a

specific groups of users. Its scope is confined to


specific, selected groups, such as marketing data mart

Independent vs. dependent (directly from warehouse) data
mart
 Virtual warehouse
 A set of views over operational databases

 Only some of the possible summary views may be

materialized
13
A recommended approach for data
warehouse development.

Data Mining: Concepts and


August 18, 2025 Techniques 14
Back end Tools
Extraction, Transformation, and Loading (ETL)
Data extraction

 get data from multiple, heterogeneous, and external

sources
Data cleaning

 detect errors in the data and rectify them when possible

Data transformation

 convert data from legacy or host format to warehouse

format
Load

 sort, summarize, consolidate, compute views, check

integrity, and build indicies and partitions


Refresh

 propagate the updates from the data sources to the

warehouse
15
Metadata Repository
 Meta data is the data defining warehouse objects. It stores:
 Description of the structure of the data warehouse

schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
 Operational meta-data

data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit
trails)
 The algorithms used for summarization
 The mapping from operational environment to the data warehouse
 Data related to system performance

warehouse schema, view and derived data definitions
 Business data

business terms and definitions, ownership of data, charging
policies
16
Chapter 4: Data Warehousing and On-line
Analytical Processing

 Data Warehouse: Basic Concepts


 Data Warehouse Modeling: Data Cube and
OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary
17
Multi dimensional model

 A data warehouse & OLAP Tools is based on a


multidimensional data model which views data in the form of a
data cube.
 A data cube allows data to be modeled and viewed in multiple
dimensions.
 It is defined by dimensions and facts.
 Dimensions: are the perspectives or entities with respect to
which an organization wants to keep records.
 Dimension table : Each dimension may have a table associated
with it, which further describes the dimension.
 Facts: numeric measures
 Fact Table : A multidimensional data model is typically
organized around a central theme, such as sales which is
represented by fact table.
 contains the names of the facts, or measures, as well as keys to
each of the related dimension tables
18
Examples of Facts:
 Sales amount
 Quantity sold
 Profit
 Revenue
 Order count
Characteristics:
 Numeric/measurable
 Typically aggregatable (SUM, AVG, COUNT, etc.)
 Linked to dimensions by foreign keys
 Can be additive, semi-additive, or non-additive:
 Additive: Can be summed across all dimensions (e.g., sales
amount)
 Semi-additive: Can be summed across some dimensions (e.g.,
account balances over time)
 Non-additive: Cannot be summed (e.g., percentages, ratios)

Data Mining: Concepts and


August 18, 2025 Techniques 19
Dimensions Table
 Dimensions provide the context for the facts — they are
descriptive attributes that help answer the "who, what, when,
where, why" of a business process.
 Examples of Dimensions:
 Customer (name, location, age group)
 Product (name, category, brand)
 Time (date, month, year)
 Location (city, region, country)
 Salesperson (name, department)
 Dimension tables can be specified by users or experts, or
automatically generated and adjusted based on data distributions.

Data Mining: Concepts and


August 18, 2025 Techniques 20
 Characteristics:
 Contain textual or categorical data
 Used for filtering, grouping, and
labeling facts
 Typically stored in dimension tables

Data Mining: Concepts and


August 18, 2025 Techniques 21
Example for fact and dimension
tables

Data Mining: Concepts and


August 18, 2025 Techniques 22
Summary of Facts and Dimensions

Term Description
A perspective or category used to analyze data.
Dimension
Example: Time, Item, Location, Supplier.
A measurable value or metric. Example: Sales
Fact
Amount, Units Sold.
Central table with measurable facts and foreign
Fact Table keys to dimension tables.ie containing measures
+ keys to dimensions)
Provides context (attributes) for each
Dimension Table
dimension.

Data Mining: Concepts and


August 18, 2025 Techniques 23
Datawarehouse for AllElectronics

Data Mining: Concepts and


August 18, 2025 Techniques 24
2D Representation

In this 2-D representation, the sales for Vancouver are shown with respect to
the time dimension (organized in quarters) and the item dimension (organized
according to the types of items sold). The fact or

measure displayed is dollars_sold (in thousands).

Data Mining: Concepts and


August 18, 2025 Techniques 25
3D Representation

Data Mining: Concepts and


August 18, 2025 Techniques 26
4D Representation

Data Mining: Concepts and


August 18, 2025 Techniques 27
Lattice of Cuboids

Data Cube as Lattice of Cuboids


 Each subset of dimensions gives a new view of the data:
 0-D (Apex Cuboid): Total sales (no dimension breakdown)
 1-D Cuboids: e.g., total sales by time
 2-D Cuboids: e.g., sales by item and time
 3-D Cuboids: e.g., sales by item, time, location
 4-D Cuboid (Base): Most detailed data (time + item +
location + supplier)

Data Mining: Concepts and


August 18, 2025 Techniques 28
Base Cuboid (Most Detailed
Data)
 The base cuboid contains the lowest level of summarization.
 It shows the raw, detailed data — for every possible
combination of all dimensions.
 Example (4 Dimensions):
 Time = Q1, Q2, Q3, Q4
 Item = Computer, Phone, etc.
 Location = Chicago, New York, etc.
 Supplier = SUP1, SUP2, SUP3
 Base Cuboid will include sales for:
 Q1, Computer, Chicago, SUP1
 Q1, Phone, New York, SUP3
 Q2, Entertainment, Vancouver, SUP2
…and so on — every possible combination!
Think of it like the bottom row of your cube — the most granular,
row-by-row data.

Data Mining: Concepts and


August 18, 2025 Techniques 29
Apex Cuboid (Most
Summarized Data)

The apex cuboid contains the highest level


of summarization.
 It is the total or grand summary of the

data, aggregated across all dimensions.


We summarize over all times, items,

locations, and suppliers.


 Apex Cuboid just shows:

Total Sales = ₹50,000 (e.g.)


Think of it like the top of a pyramid — just

one number, fully summarized


Data Mining: Concepts and
August 18, 2025 Techniques 30
 In data warehousing literature, an n-D base
cube is called a base cuboid.
 The top most 0-D cuboid, which holds the
highest-level of summarization, is called
the apex cuboid.
 The lattice of cuboids forms a data cube.

Data Mining: Concepts and


August 18, 2025 Techniques 31
Cube: A Lattice of Cuboids

all
0-D (apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D (base) cuboid


time, item, location, supplier

32
Dense cube vs Sparse cube

Sparse Cube:
Definition: A sparse cube is one where

most combinations of dimensions do not


have associated data (i.e., they're empty).
Dense Cube:
Definition: A dense cube is one where

almost every possible combination of


dimension values has a corresponding data
value.

Data Mining: Concepts and


August 18, 2025 Techniques 33
Dense cube vs Sparse cube
If you have:
|A| = 100 products
|B| = 50 stores
|C| = 12 months
Total possible combinations (in a dense cube):

100 × 50 × 12 = 60,000
But maybe only 5,000 combinations actually exist

(because many products aren’t sold in some stores


or months).
So:
Dense cube would store 60,000 cells.

Sparse cube would store only 5,000 cells.

Data Mining: Concepts and


August 18, 2025 Techniques 34
Problems

Suppose a base cuboid has three dimensions:


A, B, and C, with the following number of
cells:
|A| = 80
|B| = 15
|C| = 25
Each cube cell stores one measure of 8 bytes.
What is the total size of the computed
cube assuming the cube is dense?

Data Mining: Concepts and


August 18, 2025 Techniques 35
Step 1
Number of cuboids
Since there are 3 dimensions, the total number of cuboids is

2^3 = 8.
Now calculate the number of cells in each cuboid:

(A, B, C) = 80 × 15 × 25 = 30,000
(A, B) = 80 × 15 = 1,200
(A, C) = 80 × 25 = 2,000
(B, C) = 15 × 25 = 375
(A) = 80
(B) = 15
(C) = 25
() = 1#empty
Total number of cells = 30,000 + 1,200 + 2,000 + 375 + 80 +
15 + 25 + 1 =33,696
Data Mining: Concepts and
August 18, 2025 Techniques 36
Step 2

Multiply by bytes per cell


Each cell uses 8 bytes.

Total size = 33,696 × 8 = 269,568 bytes

Lets Try
Suppose a base cuboid has three dimensions: A, B, and C, with
the following number of cells:
|A| = 60
|B| = 12
|C| = 30
Each cube cell stores one measure of 2 bytes.
What is the total size of the computed cube assuming the cube
is dense?
Data Mining: Concepts and
August 18, 2025 Techniques 37
For Sparse Cube
 No need to calculate the total number of
possible cells, because only non-empty
cells are stored.
 So, if given the number of non-empty cells

is 10.
 Then the total size is 10×8=80 bytes.

 By assuming sparsity is around 1% to 10%


Estimated Size
of total no of cells then,
Sparsity (%) Non-empty Cells
(bytes)
100% (Dense) 33,696 33,696 × 8 = 269,568
10% 3,370 3,370 × 8 = 26,960
5% 1,684 1,684 × 8 = 13,472
1% 337 337 × 8 = 2,696
Data Mining: Concepts and
Exact case
August 18, 2025
10 Techniques
10 × 8 = 80 38
Schema for multi-dimensional
model
 The entity-relationship data model is appropriate
for online transaction processing.
 A data warehouse requires a concise, subject-
oriented schema that facilitates online data
analysis.
 A popular data model for a data warehouse is a
multidimensional model, which can exist in the
form of

a star schema,

a snowflake schema,

a fact constellation schema.

Data Mining: Concepts and


August 18, 2025 Techniques 39
Star schema

 a large central table (fact table)


containing the bulk of the data, with no
redundancy.
 a set of smaller attendant tables
(dimension tables), one for each
dimension but may contain redundancy.
 A fact table in the middle connected to a
set of dimension tables.
40
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

41
:
Snowflake schema

 A refinement of star schema where some


dimensional hierarchy is normalized into
a set of smaller dimension tables,
forming a shape similar to snowflake.
 The snowflake structure can reduce the
effectiveness of browsing, since more
joins will be needed to execute a query.

Data Mining: Concepts and


August 18, 2025 Techniques 42
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

43
Fact constellations

 Multiple fact tables share dimension tables,


viewed as a collection of stars, therefore
called galaxy schema or fact constellation.
 Sophisticated applications may require
multiple fact tables to share dimension tables.

Data Mining: Concepts and


August 18, 2025 Techniques 44
Example of Fact
Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 45
Examples for Defining Star, Snowflake,
and Fact Constellation Schemas

 To define a multidimensional schema using a


SQL-based data mining query language like
DMQL (Data Mining Query Language):
 DMQL provides two key language primitives:
 define cube — to define the data cube (the main
structure that combines dimensions and
measures).
 define dimension — to define each dimension
(the structure of categorical data).
Data Mining: Concepts and
August 18, 2025 Techniques 46
Schema for Data Warehouse and
Data Mart

 A data warehouse collects information about subjects that


span the entire organization, such as customers, items,
sales, assets, and personnel, and thus its scope is enterprise-
wide. For data warehouses, the fact constellation schema
is commonly used since it can model multiple, interrelated
subjects.
 A data mart, on the other hand, is a department subset of
the data warehouse that focuses on selected subjects, and
thus its scope is department-wide. For data marts, the star
or snowflake schema is commonly used, since both are
geared toward modeling
Data single subjects.
Mining: Concepts and
August 18, 2025 Techniques 51
A Concept Hierarchy:
Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

52
Concept Hierarchies and lattice
structures

Data Mining: Concepts and


August 18, 2025 Techniques 53
Data Cube Measures

 A data cube measure is a numeric function


that can be evaluated at each point in the
data cube space.
 Data cube space can be defined by a set of
dimension–value pairs; for example, (time
= “Q1”, location =“Vancouver”, item=
“computer”).
Data Mining: Concepts and
August 18, 2025 Techniques 54
Three Categories of Measures

 Distributive: if the result derived by applying the


function to n aggregate values is the same as that
derived by applying the function on all the data without
partitioning

E.g., count(), sum(), min(), max()
 Algebraic: if it can be computed by an algebraic
function with M arguments (where M is a bounded
integer), each of which is obtained by applying a
distributive aggregate function

E.g., avg(), min_N(), standard_deviation()
 Holistic: if there is no constant bound on the storage
size needed to describe a subaggregate.

E.g., median(), mode(), rank()
55
Examples
 Dataset: [3, 7, 2, 9]
Split into two groups: [3, 7] and [2, 9]
 sum([3, 7]) = 10, sum([2, 9]) = 11, total

sum = 10 + 11 = 21
 sum([3, 7, 2, 9]) = 21 → same result

So, sum() is distributive.


 Group 1: [3, 7] → sum = 10, count = 2

Group 2: [2, 9] → sum = 11, count = 2


Total avg = (10 + 11) / (2 + 2) = 21 / 4 =
5.25
So, avg() is algebraic – it uses sum and
count,
August 18, 2025 which are distributive.
Data Mining: Concepts and
Techniques 56
Examples
 Dataset: [3, 7, 2, 9]
Split into: [3, 7] and [2, 9]
You can’t find the median of the full
dataset just by knowing the medians of
each group.
 Median of [3, 7] = 5
 Median of [2, 9] = 5.5
But overall median of [2, 3, 7, 9] = (3 +
7)/2 = 5 → doesn’t match
 ❌ So, median() is not distributive or
algebraic → it's holistic
Data Mining: Concepts and
August 18, 2025 Techniques 57
Typical OLAP Operations
 Roll up (drill-up): summarize data

by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up

from higher level summary to lower level summary or
detailed data, or introducing new dimensions

Slice and dice: project and select
 Pivot (rotate):

reorient the cube, visualization, 3D to series of 2D
planes
 Other operations

drill across: involving (across) more than one fact table

drill through: through the bottom level of the cube to
its back-end relational tables (using SQL)

58
Comparison Chart

Example From Our


Operation What it does
Table
Aggregates up (e.g., City → Group sales by Country
Roll-up
Country) and Month
Drill-down Gets more detailed Go from Country to City
Slice Fix one dimension Filter for Product = iPhone
Filter by multiple City = Chicago, Product IN
Dice
dimensions (iPhone, Galaxy)
Turn rows → columns for
Pivot Rotate cube layout
Product by City
Combine sales with
Drill-across Join multiple fact tables
inventory
View actual sales records
Drill-through Go to raw transactions
for one summary cell
Data Mining: Concepts and
August 18, 2025 Techniques 59
OLAP operations with Example

Date City Product Category Sales

2025-07-01 Chicago iPhone Mobile 1000

2025-07-01 Chicago Galaxy Mobile 900

2025-07-01 New York iPhone Mobile 1100

2025-07-01 New York LaptopX Laptop 1200

2025-07-02 Chicago iPhone Mobile 1000

2025-07-02 New York Galaxy Mobile 950

2025-07-02 Chicago LaptopX Laptop 1300

Data Mining: Concepts and


August 18, 2025 Techniques 60
🔼 1. Roll-up
 Goal: Go from detailed → summarized (e.g.,
City → State, Date → Month)
 Assume:

 Chicago & New York → USA

 July 2025 is the only month shown

Query/Transformation:
 Group by Country and Month:

Country Month Sales


USA July 7450

Data Mining: Concepts and


August 18, 2025 Techniques 61
🔽 2. Drill-down
 Goal: Go from summarized → detailed

Country Month Sales


USA July 7450

 Drill down to City and Date:


Date City Sales
2025-07-01 Chicago 1900
2025-07-01 New York 2300
2025-07-02 Chicago 2300
2025-07-02 New York 950

Data Mining: Concepts and


August 18, 2025 Techniques 62
🔪 3. Slice
 Goal: Fix one dimension (select specific
value)
 Example: Slice on Product = 'iPhone'
 You removed all non-iPhone rows.
Date City Product Sales
2025-07-01 Chicago iPhone 1000
2025-07-01 New York iPhone 1100
2025-07-02 Chicago iPhone 1000

Data Mining: Concepts and


August 18, 2025 Techniques 63
🎲 4. Dice
 Goal: Filter on multiple dimensions
 Example:
 Product IN ('iPhone', 'Galaxy')
 City IN ('Chicago')
 Date = 2025-07-01
 Result:

Date City Product Sales


2025-07-01 Chicago iPhone 1000
2025-07-01 Chicago Galaxy 900

Data Mining: Concepts and


August 18, 2025 Techniques 64
🔄 5. Pivot (Rotate)
 Goal: Change data layout for visualization
 Before (standard table):

City Product Sales


Chicago iPhone 1000
Chicago Galaxy 900
New York iPhone 1100

iPhone Galaxy
 Rows: City Chicago 1000 900
 Columns: Product New York 1100 -
 Values: Sales

Data Mining: Concepts and


August 18, 2025 Techniques 65
🔁 6. Drill-across
 Goal: Compare multiple fact tables side-
by-side
 Let’s say you have another table:

inventory_data
Date City Product Inventory
2025-07-01 Chicago iPhone 30
2025-07-01 Chicago Galaxy 25

 Drill-across:
 Join
Datesales_data
City andProduct
inventory_data:
Sales Inventory
2025-07-01 Chicago iPhone 1000 30
2025-07-01 Chicago Galaxy 900 25
Data Mining: Concepts and
August 18, 2025 Techniques 66
7. Drill-through
 Goal: From cube (aggregated summary) to
raw, transactional records
 Suppose your summary cube says:
City Product Sales
Chicago iPhone 2000
 You drill through to raw transaction
records:
Transactio CustomerI
Date City Product Sales
nID D
2025-07-
TX001 C123 Chicago iPhone 1000
01
2025-07-
TX002 C456 Chicago iPhone 1000
02
 This shows the real data entries that make up the
cube cell. Data Mining: Concepts and
August 18, 2025 Techniques 67
68
Chapter 4: Data Warehousing and On-line
Analytical Processing

 Data Warehouse: Basic Concepts


 Data Warehouse Modeling: Data Cube and
OLAP
 Data warehouse to data mining
 Data Generalization by Attribute-Oriented
Induction
 Summary

69
Datawarehouse Applications
 Information processing supports querying, basic statistical analysis,
and reporting using crosstabs, tables, charts, or graphs. A current trend
in data warehouse information processing is to construct low-cost Web-
based accessing tools that are then integrated with Web browsers.
 Analytical processing supports basic OLAP operations, including slice-
and-dice, drill-down, roll-up, and pivoting. It generally operates on
historical data in both summarized and detailed forms. The major
strength of on-line analytical processing over information processing is
the multidimensional data analysis of data warehouse data.
 Data mining supports knowledge discovery by finding hidden patterns
and associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools.
Data Mining: Concepts and
August 18, 2025 Techniques 70
Information processing is not data mining

 Information processing, based on queries, can find


useful information.
 However, answers to such queries reflect the
information directly stored in databases or
computable by aggregate functions.
 They do not reflect sophisticated patterns or

regularities buried in the database.

Data Mining: Concepts and


August 18, 2025 Techniques 71
OLAP and DM

 OLAP is a data summarization/aggregation tool that helps


simplify data analysis.
 Data mining allows the automated discovery of implicit
patterns and interesting knowledge hidden in large amounts of
data.
 OLAP tools are targeted toward simplifying and supporting
interactive data analysis.
 data mining tools is to automate as much of the process as
possible, while still allowing users to guide the process.
 data mining goes one step beyond traditional on-line analytical
processing. Data Mining: Concepts and
August 18, 2025 Techniques 72
Data Mining
 Data, mining covers both data description and data
modeling. It performs not only data summary and
comparison but also association, classification, prediction,
clustering, time-series analysis, and other data analysis
tasks.
 It may analyze data existing at more detailed granularities
than the summarized data provided in a data warehouse.
 It may also analyze transactional, spatial, textual, and
multimedia data that are difficult to model with current
multidimensional database technology.

Data Mining: Concepts and


August 18, 2025 Techniques 73
to
On-Line Analytical Mining
 On-line analytical mining (OLAM) (also called OLAP
mining) integrates on-line analytical processing
(OLAP) with data mining and mining knowledge in
multidimensional databases.
Importance:
 High quality of data in data warehouses.

 Available information processing


infrastructure surrounding data
warehouses.
 OLAP-based exploratory data analysis.

 On-line selection of data mining functions.


Data Mining: Concepts and
August 18, 2025 Techniques 74
Data Mining: Concepts and
August 18, 2025 Techniques 75
Design of Data Warehouse: A
Business Analysis Framework
 Four views regarding the design of a data
warehouse

Top-down view

allows selection of the relevant information necessary
for the data warehouse

Data source view

exposes the information being captured, stored, and
managed by operational systems

Data warehouse view

consists of fact tables and dimension tables

Business query view

sees the perspectives of data in the warehouse from
the view of end-user
76
Data Warehouse Design
Process
 Top-down, bottom-up approaches or a combination of both

Top-down: Starts with overall design and planning (mature)

Bottom-up: Starts with experiments and prototypes (rapid)
 From software engineering point of view

Waterfall: structured and systematic analysis at each step
before proceeding to the next

Spiral: rapid generation of increasingly functional systems,
short turn around time, quick turn around
 Typical data warehouse design process

Choose a business process to model, e.g., orders, invoices, etc.

Choose the grain (atomic level of data) of the business process

Choose the dimensions that will apply to each fact table record

Choose the measure that will populate each fact table record

77
Data Warehouse
Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts

Data Data Enterprise


Mart Mart Data
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


78
Data Warehouse Usage
 Three kinds of data warehouse applications

Information processing

supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs

Analytical processing

multidimensional analysis of data warehouse data

supports basic OLAP operations, slice-dice, drilling,
pivoting

Data mining

knowledge discovery from hidden patterns

supports associations, constructing analytical models,
performing classification and prediction, and presenting
the mining results using visualization tools
79
From On-Line Analytical Processing
(OLAP)
to On Line Analytical Mining (OLAM)
 Why online analytical mining?

High quality of data in data warehouses

DW contains integrated, consistent, cleaned
data

Available information processing structure
surrounding data warehouses

ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools

OLAP-based exploratory data analysis

Mining with drilling, dicing, pivoting, etc.

On-line selection of data mining functions

Integration and swapping of multiple mining
functions, algorithms, and tasks
80
Chapter 4: Data Warehousing and On-line
Analytical Processing

 Data Warehouse: Basic Concepts


 Data Warehouse Modeling: Data Cube and
OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary
81
The “Compute Cube” Operator
 Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
 Transform it into a SQL-like language (with a new operator
cube by, introduced by Gray et al.’96)
()
SELECT item, city, year, SUM (amount)
FROM SALES
(city) (item) (ye
CUBE BY item, city, year
-- we also use group by
 Need compute the following Group-Bys (city, item) (city, year) (item
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer) (city, item, year)
()
82
Efficient Data Cube
Computation
 Data cube can be viewed as a lattice of cuboids

The bottom-most cuboid is the base cuboid

The top-most cuboid (apex) contains only one cell

How many cuboids in an n-dimensional cube with
L levels? n
T   ( Li 1)
 You have 2 dimensions: i 1
 Time has 2 levels: Month and Year → So L₁ =
2
 Product has 1 level: Category → So L₂ = 1
 Now apply the formula:
 Total Cuboids = (L₁ + 1) × (L₂ + 1)
Total Cuboids = (2 + 1) × (1 + 1) = 3 × 2 = 6
83
Curse of Dimensionality
 If you have many dimensions, the number
of cuboids grows very fast:
 For n dimensions, 2ⁿ cuboids
 Example: If 10 dimensions, each with 5 levels
→ (5+1)^10=6¹⁰ = over 60 million cuboids!
 This causes huge storage and processing
issues, which is called the curse of
dimensionality.
 Solution: Materialization
 We don’t compute/store all cuboids. Instead,
we use Materialization strategies:
Data Mining: Concepts and
August 18, 2025 Techniques 84
Types of Materialization of data
cube
 Full Materialization-Precompute and store all cuboids

Advantage: Very fast query response — no need to
calculate anything at query time

Disadvantage: Needs a lot of storage and time
 No Materialization: Don’t store any cuboids — compute
everything on the fly

Advantage: Saves space

Disadvantage: Slow query response
 Partial Materialization: (Best in practice)

Precompute and store only the most useful cuboids

Use when: You want a good balance between speed
and storage
Data Mining: Concepts and
August 18, 2025 Techniques 85
Indexing OLAP Data: Bitmap
Index
 Bitmap indexing represents attribute values using bit vectors
(bitmaps) to allow fast querying and aggregation, especially in low-
cardinality domains (i.e., when an attribute has few distinct values).
 Index on a particular column Each value in the column has a bit
vector: bit-op is fast
 The length of the bit vector: No. of records in the base table
 a bit vector for each value of a single attribute.
 not suitable for high cardinality domains

Base table Index on Region Index on Type


Cust Region Type RecIDAsia Europe America RecID Retail Dealer
C1 Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 America Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1
87
Cust Region Type
C1 Asia Retail RecIDAsia Europe America
C2 Europe Dealer 1 1 0 0
C3 Asia Dealer 2 0 1 0
C4 America Retail 3 1 0 0
C5 Europe Dealer 4 0 0 1
5 0 1 0

Region Bitmap
Asia 10100
Europe 01001
America 00010

Data Mining: Concepts and


August 18, 2025 Techniques 88
Indexing OLAP Data: Join Indices
 Join Indexing — Fast Join Between Tables
 Use Case:
 We want to precompute joins between a
fact table and dimension table for faster
access. OrderID CustomerID Amount
5001 101 $250
5002 102 $400
5003 101 $180

CustomerID Name
101 Alice
102 Bob
CustomerID OrderIDs
101 5001, 5003
102 5002
89
Efficient Processing OLAP Queries
 The purpose of materializing cuboids and constructing OLAP index
structures is to speed up query processing in data cubes
 Determine which operations should be performed on the
available cuboids
 Transform drill, roll, etc. into corresponding SQL and/or OLAP
operations, e.g., dice = selection + projection
 Determine which materialized cuboid(s) should be selected for
OLAP op.
 sales cube [time, item, location] : sum(sales in dollars)
 The cube tracks sales amounts (measure)
 Along three dimensions:
 Time (with hierarchy: day < month < quarter < year)
 Item (with hierarchy: item name < brand < type)
 Location (with hierarchy: street < city < province/state < country)
90
 The user wants to know:
 Total sales by brand and province or state, where year = 2004
 Filters on year = 2004
 Groups by: brand and province or state
 Available Materialized Cuboids (Precomputed)
 We have 4 cuboids stored in advance:

 Which Cuboid(s) Can Be Used?


Cuboid #
 Dimensions in the Cuboid
We need a cuboid that can satisfy: Grouping by: brand, province/state
 Selection: year = 2004
1 {year, item name, city}
2 {year, brand, country}
 Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized
3 cuboids available: {year, brand, province or state}
1) {year, item_name, city}
4 2) {year, brand, country}{item name, province or state} with filter year = 2004
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?

Dataarray
Explore indexing structures and compressed vs. dense Mining: Concepts
structs in MOLAP and
August 18, 2025 Techniques 91
Cuboid # Dimensions in the Cuboid
1 {year, item name, city}
2 {year, brand, country}
3 {year, brand, province or state}
4 {item name, province or state} with filter year = 2004

Find Total sales by brand and province or state,


 where year
C1: item name = 2004
→ more detailed than brand (Roll Up) Usable, but not
optimal.
 C2:country is more general (higher level in hierarchy) Cannot drill down
from country → province/state . Not usable
 C3:Exact match for what we need:Group by: brand, province/stateFilter:
year = 2004 (available)Best match, likely smallest size: Ideal choice
 C4: item name (finer than brand → can roll up)province/state (good)year
= 2004 already selected (filter is applied)Might be efficient if well-
indexed. Medium Choice
 In real systems, we use cost-based estimation to choose the cuboid with:

Lowest processing time

Smallest I/O

Efficient indexing Data Mining: Concepts and
August 18, 2025 Techniques 92
Why Can’t You Drill Down from
country? C2
 Because of data loss during aggregation.
 Let’s say you have a cuboid:

Year Brand Country Sum(Sales)


2004 Samsung USA $500,000
2004 Samsung Canada $200,000

Year Brand Province Sum(Sales)


2004 Samsung California ???
2004 Samsung Texas ???

Data Mining: Concepts and


August 18, 2025 Techniques 93
Additional Strategies for Fast
Query Responses
MOLAP- achieves fast OLAP query responses by using optimized array
structures, compression for sparsity, and intelligent indexing
Online Aggregation (Real-time approximate results)
System shows partial/estimated results while the full query is still

processing.
These results are:

 Refined over time as more data is processed.


 Accompanied by confidence intervals (estimates of accuracy).
Benefit: Users can get early feedback and adjust their queries

without waiting.
Top-N Queries
Instead of getting all records sorted by some metric (like sales), you

ask only for the Top N results.


 E.g., “Show top 10 selling products.”
This reduces processing time and improves interactivity.

Ideal for dashboards or quick decision-making.


Data Mining: Concepts and
August 18, 2025 Techniques 94
OLAP Server Architectures
 Relational OLAP (ROLAP)

Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware

Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services

Greater scalability
 Multidimensional OLAP (MOLAP)

Sparse array-based multidimensional storage engine

Fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)

Flexibility, e.g., low level: relational, high-level: array
 Specialized SQL servers (e.g., Redbricks)

Specialized support for SQL queries over star/snowflake
schemas
95
Chapter 4: Data Warehousing and On-line
Analytical Processing

 Data Warehouse: Basic Concepts


 Data Warehouse Modeling: Data Cube and
OLAP
 Data Warehouse Design and Usage
 Data cube Technology
 Data Generalization by Attribute-Oriented
Induction
 Summary
96
Data generalization DataCube
Computation

Data generalization is the process of


summarizing and abstracting a large
amount of task-relevant data in a database,
moving from a detailed (low-level) to a
more general (high-level) form.

Data Mining: Concepts and


August 18, 2025 Techniques 97
Data Cube: A Lattice of Cuboids
all
0-D(apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


2-D cuboids
time,item time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,locationtime,item,supplier item,location,supplier

4-D(base) cuboid

time, item, location, supplierc

98
Data Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time item location supplier
1-D cuboids
time,item time,location item,location location,supplier
time,supplier item,supplier 2-D cuboids
time,location,supplier

time,item,location item,location,supplier
3-D cuboids
time,item,supplier

time, item, location, supplier 4-D(base) cuboid

 Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child
cells
1. (9/15, milk, Urbana, Dairy_land)
2. (9/15, milk, Urbana, *)
3. (*, milk, Urbana, *)
4. (*, milk, Urbana, *)
5. (*, milk, Chicago, *)
6. (*, milk, *, *)
99
Iceberg Cube, Closed Cube & Cube Shell
 Is iceberg cube good enough?
 2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}

How many cells will the iceberg cube have if having count(*)
>= 10? Hint: A huge but tricky number!
 Close cube:

Closed cell c: if there exists no cell d, s.t. d is a descendant of
c, and d has the same measure value as c.

Closed cube: a cube consisting of only closed cells

What is the closed cube of the above base cuboid? Hint: only
3 cells
 Cube Shell

Precompute only the cuboids involving a small # of
dimensions, e.g., 3
For (A1, A2, … A10), how many combinations to

More dimension combinations will need to be computed on
compute?
the fly
100
Efficient Methods for Data Cube
Computation
 A Road Map for the Materialization of Different
Kinds of Cubes
 Multiway Array Aggregation for Full Cube
Computation
 BUC: Computing Iceberg Cubes from the Apex
Cuboid Downward
 Star-Cubing: Computing Iceberg Cubes Using a
Dynamic Star-Tree Structure
 Precomputing Shell Fragments for Fast High-
Dimensional OLAP
 Computing Cubes with Complex Iceberg
Conditions
Data Mining: Concepts and
August 18, 2025 Techniques 101
A Road Map for the Materialization
of Different Kinds of Cubes
 Iceberg Cubes
 Store only those cells above a threshold
(e.g., sales ≥ $100). Saves space by
ignoring low-value or empty data. Can be
defined in SQL using a HAVING clause.
 SELECT month, city, customer_group,
COUNT(*)
 FROM salesInfo
 CUBE BY month, city, customer_group
 HAVING COUNT(*) >= 10;
Data Mining: Concepts and
August 18, 2025 Techniques 102
 Data cube computation is an essential task in DWH. Pre-
computation of all or part of a data cube can greatly reduce
the response time and increase performance of OLAP. This is
a challenging task.
 How can we compute data cubes in advance?
 Cube Materialization:
 (i) Base Cell:- A cell in base cuboid is base cell.
 (ii) Aggregate Cell:- A cell in non-base cuboid. Each
aggregated dimension indicated as '*'.
 Ex:- If we take 3D-cube all 1-D, 2-D cells are aggregate cells
and 3-D is Base cell.
 (iii) Ancestor & Descendent cells:- 1-D and 2-D cells are
ancestors of 3-D cell. 3-D is descendent cell.

Data Mining: Concepts and


August 18, 2025 Techniques 103
 Full cube:- Computation all the cells of all the cuboids
in data cube.
 Iceberg cube:- Computing only the cuboid cells whose
measure satisfies the iceberg condition. EX:- Only a
small portion of cells may be "above the water" in a
sparse cube

compute cube sales iceberg as

select month, city, customer_group, count(*)

from salesInfo

cube by month, city, customer_group having
count(*) >= min support
 Closed Cube:- The cube consists of only closed cells. A
cell that has no descendant cell than it is closed cell.
Data Mining: Concepts and
August 18, 2025 Techniques 104
General Strategies for Data
Cube Computation
 1. Sorting, Hashing, and Grouping
 Purpose: Cluster and organize data to
facilitate aggregation.
 Sort or hash dimension attributes to group
tuples (rows) with the same values.
 This helps aggregate values efficiently
(e.g., total sales by branch/day/item).
 Shared-sorts and shared-partitions reduce
redundant sorting and partitioning.

Data Mining: Concepts and


August 18, 2025 Techniques 105
 2. Simultaneous Aggregation and
Caching Intermediate Results
 Purpose: Avoid redundant reads and
recalculations.
 Use lower-level aggregates to compute
higher-level ones.
 Cache intermediate results to reduce disk
I/O.
 Techniques like amortized scans compute
multiple cuboids in a single pass.
Data Mining: Concepts and
August 18, 2025 Techniques 106
 3. Aggregation from the Smallest Child
 Purpose: Use the least costly child
cuboid for parent computation.
 If multiple child cuboids exist, choose the
one with fewer distinct values.
 Example: Computing sales by branch from
(branch, year) is faster than from (branch,
item) if there are fewer years than items.

Data Mining: Concepts and


August 18, 2025 Techniques 107
 4. Apriori Pruning (for Iceberg Cubes)
 Purpose: Eliminate unimportant or low-
support cells early.
 Apriori property: If a cell doesn’t meet
the threshold (e.g., count < 10), its
descendants won’t either.
 Saves time and space by avoiding
unnecessary computation.
 Works well with antimonotonic
measures (like count or sum).
Data Mining: Concepts and
August 18, 2025 Techniques 108
 4. Apriori Pruning (for Iceberg Cubes)
 Purpose: Eliminate unimportant or low-
support cells early.
 Apriori property: If a cell doesn’t meet
the threshold (e.g., count < 10), its
descendants won’t either.
 Saves time and space by avoiding
unnecessary computation.
 Works well with antimonotonic
measures (like count or sum).
Data Mining: Concepts and
August 18, 2025 Techniques 109
 Iceberg cubes materialize only significant
cells, where values meet a minimum
support threshold.
 Apriori pruning reduces exploration of
irrelevant aggregate cells.

Data Mining: Concepts and


August 18, 2025 Techniques 110
Computing Full/Iceberg Cubes:
3 Methodologies

Key
Methodology Type Strategy Best For
Technique

Full scan, Array-based


Multi-Way Bottom-Up Full cubes
partitioned aggregation
Recursive Iceberg cubes, Apriori-style
BUC Top-Down
pruning skewed data pruning
Bottom-up + Iceberg cubes, Star-tree,
Star-Cubing Hybrid
Top-down large data shared dims

Data Mining: Concepts and


August 18, 2025 Techniques 111
Method Description

MultiWay Full cube computation using arrays

Iceberg cubes from top (apex)


BUC
downward

Star-Cubing Combines top-down and bottom-up

Efficient computation for high-


Shell Fragments
dimensional cubes
Deals with complex measures like
Complex Iceberg Cubes
averages

Data Mining: Concepts and


August 18, 2025 Techniques 112
Data Cube Technology

 Data Cube Computation Methods

 Multi-Way Array Aggregation

 BUC

 Star-Cubing

 High-Dimensional OLAP

113
Multi-Way Array Aggregation
All
 Array-based “bottom-up”
algorithm
A B C
 Using multi-dimensional chunks
 No direct tuple comparisons
AB AC BC
 Simultaneous aggregation on
multiple dimensions
ABC
 Intermediate aggregate values are
re-used for computing ancestor
cuboids
 Cannot do Apriori pruning: No
iceberg optimization
114
Multi-way Array Aggregation for Cube
Computation (MOLAP)
 Partition arrays into chunks (a small subcube which fits in memory).
 Compressed sparse array addressing: (chunk_id, offset)
 Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces
memory access and storage cost.

C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32 What is the best
c0
b3 B13 14 15 16 60 traversing order
44
9
28 56 to do multi-way
b2
B 40
24 52 aggregation?
b1 5 36
20
b0 1 2 3 4
a0 a1 a2 a3
A 115
 4*4*4=64 Chunks
What is the best traversing order to do multi-way aggregation?
 Suppose the cardinality of A,B,C is 40,400,4000 respectively.
Thus the size of each partition is 10, 100, 1000 respectively.
 Computation of all cuboids means ABC, AB, AC, BC, A, B, C,

Apex.
 Possible orderings are -

(i) Ordering from 1,2, 3, 64. By scanning chunks of 1 to 4 of


ABC, computed (aggregated over a0 to a3). Multiway
aggregation used to avoid rescan of particular chunks. To
avoid 3-D chunk more than once into memory, the minimum
requirement for holding all relevant 2-D planes,
 whole AB plane + one row of AC + one chunk of BC plane

 i.e 40*400+40*1000 + 100 *1000 = 1,56,000 memory units

Data Mining: Concepts and


August 18, 2025 Techniques 116
(ii) Ordering from 1, 17, 33, 49, 5, 21, 37, 53.
whole BC plane + one row of AC plane +

one chunk of AB plane


i.e 400*4000 + 40*1000 + 10*100 =

16,41,000 memory units.


Therefore the most efficient one is 1,2,...64

chunks order.

Data Mining: Concepts and


August 18, 2025 Techniques 117
Multi-way Array Aggregation for Cube
Computation (3-D to 2-D)
All

A B C

a ll
AB AC BC

A B C

ABC
AB AC BC

 The best order is


ABC
the one that
minimizes the
memory
requirement and
reduced I/Os
118
Multi-way Array Aggregation for Cube
Computation (2-D to 1-D)

All

A B C

AB AC BC

ABC
119
Bottom-Up Computation (BUC)
a ll
 BUC is an algorithm ("bottom-up
constuction") for the computation of A B C D

sparse and iceberg cubes.


AB AC AD BC BD CD
 Unlike Multiway, BUC constructs the
cube from the apex cuboid toward ABC ABD ACD BCD

the base cuboid which allows BUC


to share data partitioning costs. ABCD

 This processing order also allows 1 a ll

BUC to prune during construction,


using the Apriori property. 2 A 10 B 14 C 16 D

 The best order is the one that


minimizes the memory requirement 3 A B 7 AC 9 AD 11 BC 13 BD 15 C D

and reduced I/Os


 No simultaneous aggregation 4 ABC 6 ABD 8 ACD 12 BC D

5 ABCD
120
Data Mining: Concepts and
August 18, 2025 Techniques 121
Algorithm Explanation

 Initially, the algorithm is called with the input relation (set of tuples).
BUC aggregates the entire input (line 1) and writes the resulting total
(line 3). (Line 2 is an optimization feature that is discussed later in our
example.)
 For each dimension d (line 4), the input is partitioned on d (line6). On
return from Partition(), dataCount contains the total number of tuples for
each distinct value of dimension d. Each distinct value of d forms its own
partition.
 Line 8iterates through each partition. Line 10 tests the partition for
minimum support. That is, if the number of tuples in the partition
satisfies (i.e., is ) the minimum support, then the partition becomes the
input relation for a recursive call made to BUC, which computes the
iceberg cube on the partitions for dimensions d C1 to numDims (line 12).
Data Mining: Concepts and
August 18, 2025 Techniques 122
 Note that for a full cube (i.e., where minimum
support in the having clause is 1), the
 minimum support condition is always satisfied.
Thus, the recursive call descends one level deeper
into the lattice. On return from the recursive call,
we continue with the next partition for d. After all
the partitions have been processed, the entire
process is repeated for each of the remaining
dimensions.
Data Mining: Concepts and
August 18, 2025 Techniques 123
Star-Cubing: An Integrating Method
 Computing Iceberg Cubes by Top-Down + Bottom-UP
Integration.
 Explore shared dimensions
 E.g., dimension A is the shared dimension of ACD and AD
 ABD/AB means cuboid ABD has shared dimensions AB
 Allows for shared computations
 e.g., cuboid AB is computed simultaneously as ABD
C /C D
 Aggregate in a top-down
manner but with the
A C /A C A D /A B C /B C B D /B CD
bottom-up sub-layer
underneath which will allow A C D /A
A B C /A B C A B D /A B BCD
Apriori pruning
 Shared dimensions grow in A B C D /a ll
bottom-up fashion 124
 Star-Cubing prunes (skips) unnecessary
computations during iceberg cube computation by
recognizing shared dimensions in a subtree of
cuboids.
 For example, cuboid AB extending from ABD in Figure 4.8
would actually be pruned because AB was already computed
in ABD=AB. Similarly, cuboid A extending from AD would
also be pruned because it was already computed in ACD=A.

Data Mining: Concepts and


08/18/25 Techniques 125
 Star-Cubing algorithm works, few more concepts,
namely, cuboid trees, star-nodes, and star-trees.
 Each level in the tree represents a dimension, and
each node represents an attribute value. Each
node has four fields: the attribute value, aggregate
value, pointer(s) to possible descendant(s), and
pointer to possible sibling.

A fragment of th
base cuboid tree

Data Mining: Concepts and


August 18, 2025 Techniques 126
 A node A is a star-node
➤ If the single-dimensional aggregate
on p does not satisfy the iceberg
condition.
 Otherwise, p is a non-star-node.
 A cuboid tree that is compressed using
star-nodes is called a: ➤ Star-Tree

Data Mining: Concepts and


August 18, 2025 Techniques 127
 Base Cuboid Table (Table 4.1):
➤ Contains 5 tuples and 4 dimensions: A, B, C,
D
➤ Cardinalities:
 A → 2 values
 B, C, D → 4 values each
 Iceberg Condition:
A ➤ MinimumB supportC (minsup)
D = 2 count
a1 b1 c1 d1 1
a1 b1 c4 d3 1
a1 b2 c2 d2 1
a2 b3 c3 d4 1
a2 b4 c3Mining: Concepts andd4
Data 1
August 18, 2025 Techniques 128
A B C D count
a1 b1 c1 d1 1
a1 b1 c4 d3 1
a1 b2 c2 d2 1
a2 b3 c3 d4 1
a2 b4 c3 d4 1

Data Mining: Concepts and


August 18, 2025 Techniques 129
A B C D count
a1 b1 c1 d1 1
a1 b1 c4 d3 1
a1 b2 c2 d2 1
a2 b3 c3 d4 1
a2 b4 c3Mining: Concepts andd4
Data 1
August 18, 2025 Techniques 130
Data Mining: Concepts and
August 18, 2025 Techniques 131
Chapter 4: Data Warehousing and On-line
Analytical Processing

 Data Warehouse: Basic Concepts


 Data Warehouse Modeling: Data Cube and
OLAP
 Data Warehouse Design and Usage
 Data cube Technology
 Data Generalization by Attribute-Oriented
Induction
 Summary
132
Attribute-Oriented Induction
 Proposed in 1989 (KDD ‘89 workshop)
 Not confined to categorical data nor particular
measures
 How it is done?
 Collect the task-relevant data (initial relation) using
a relational database query
 Perform generalization by attribute removal or
attribute generalization
 Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts
 Interaction with users for knowledge presentation

133
Attribute-Oriented Induction: An
Example
Example: Describe general characteristics of graduate
students in the University database
 Step 1. Fetch relevant set of data using an SQL
statement, e.g.,
Select * (i.e., name, gender, major, birth_place,
birth_date, residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }
 Step 2. Perform attribute-oriented induction
 Step 3. Present results in generalized relation, cross-tab,
or rule forms

134
Class Characterization: An Example

Name Gender Major Birth-Place Birth_date Residence Phone # GPA

Initial Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67


Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …

Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime M Science Canada 20-25 Richmond Very-good 16
Generalized F Science Foreign 25-30 Burnaby Excellent 22
Relation … … … … … … …

Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62

135
Basic Principles of Attribute-Oriented
Induction
 Data focusing: task-relevant data, including dimensions,
and the result is the initial relation
 Attribute-removal: remove attribute A if there is a large
set of distinct values for A but (1) there is no
generalization operator on A, or (2) A’s higher level
concepts are expressed in terms of other attributes
 Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A
 Attribute-threshold control: typical 2-8, specified/default
 Generalized relation threshold control: control the final
relation/rule size

136
Attribute-Oriented Induction: Basic
Algorithm
 InitialRel: Query processing of task-relevant data,
deriving the initial relation.
 PreGen: Based on the analysis of the number of
distinct values in each attribute, determine
generalization plan for each attribute: removal? or
how high to generalize?
 PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
 Presentation: User interaction: (1) adjust levels by
drilling, (2) pivoting, (3) mapping into rules, cross
tabs, visualization presentations.
137
Presentation of Generalized
Results
 Generalized relation:
 Relations where some or all attributes are generalized, with
counts or other aggregation values accumulated.
 Cross tabulation:
 Mapping results into cross tabulation form (similar to
contingency tables).
 Visualization techniques:
 Pie charts, bar charts, curves, cubes, and other visual
forms.
 Quantitative characteristic rules:
 Mapping generalized result into characteristic rules with
grad ( x)  male( x) 
quantitative information associated with it, e.g.,
birth _ region( x) "Canada"[t :53%]  birth _ region( x) " foreign"[t : 47%].
138
Mining Class Comparisons

 Comparison: Comparing two or more classes


 Method:
 Partition the set of relevant data into the target class and
the contrasting class(es)
 Generalize both classes to the same high level concepts
 Compare tuples with the same high level descriptions
 Present for every tuple its description and two measures
 support - distribution within single class
 comparison - distribution between classes
 Highlight the tuples with strong discriminant features
 Relevance Analysis:
 Find attributes (features) which best distinguish different
classes
139
Chapter 4: Data Warehousing and On-line
Analytical Processing

 Data Warehouse: Basic Concepts


 Data Warehouse Modeling: Data Cube and
OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary
140
Summary
 Data warehousing: A multi-dimensional model of a data warehouse
 A data cube consists of dimensions & measures
 Star schema, snowflake schema, fact constellations
 OLAP operations: drilling, rolling, slicing, dicing and pivoting
 Data Warehouse Architecture, Design, and Usage
 Multi-tiered architecture
 Business analysis design framework
 Information processing, analytical processing, data mining,
OLAM (Online Analytical Mining)
 Implementation: Efficient computation of data cubes
 Partial vs. full vs. no materialization
 Indexing OALP data: Bitmap index and join index
 OLAP query processing
 OLAP servers: ROLAP, MOLAP, HOLAP
 Data generalization: Attribute-oriented induction
141

You might also like