Data Warehouse Basics & Models
Data Warehouse Basics & Models
Definition:-
“A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data which is used for decision-making
process.”— W. H. Inmon
G.Karuna, GRIET
April 14, 2022 1
Data Warehouse - Introduction
Subject oriented :-
Organized around major subjects, such as customer, product, sales,
item etc.
Focusing on the modeling and analysis of data for decision makers,
not on daily operations or transaction processing.
Provide a simple and concise view around particular subject.
Integrated:-
Constructed by integrating multiple, heterogeneous data sources like
relational databases, flat files, on-line transaction records and put in
a consistent format.
Time-variant:-
The Data are stored to provide information from a historical
perspective.
Implicitly or explicitly the key structure in DWH contains an element
of time.
Non-volatile :-
DWH is always a physically separate store of data, transformed from
application data found in operational environment.
Operational update of data does not occur in the data warehouse
environment. Only we can extract data, but we do not modify.
G.Karuna, GRIET
April 14, 2022 3
Data Warehouse vs. Operational DBMS
OLTP Systems:-
Major task – to perform on line transaction, Query processing
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP Systems:-
Major task – to perform data analysis & decision making for
knowledge workers.
Data organized in different formats.
Distinct features (OLTP vs. OLAP):-
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
Access patterns: update vs. read-only but complex queries
April 14, 2022 G.Karuna, GRIET 4
OLTP vs. OLAP
S.No. Feature OLTP OLAP
1 Definition Online Transaction Processing Online analytical Processing
2 Characteristic Operational processing Informational processing
3 Orientation Transaction Analysis
4 users DBA, DB professional Knowledge workers (Manager)
5 Function Day to day opearation Historical (or) Decision Support
6 DB Design ER based Design & Star/Snowflake schemas &
Application oriented Subject-oriented
7 Data Dynamic, current, relational, detailed, up Static, historical, summarized,
to date consolidated, multidimentional
8 View Detailed view Summarized view
9 Unit of work Short & simple transaction Complex queries required
10 Access Read / Write Mostly read
11 No. Records Tens/Hundreds Millions
12 Users Thousands Hundreds
13 Db size 100MB to GB 100 GB to TB
14 Priority High performance High flexibility
15 Metrics Transaction throughput Query throughput
16 Focus data in Information out
17 Operations Indexing, hashing Lots of scans for analysis
April 14, 2022 G.Karuna, GRIET 5
Why Separate Data Warehouse?
To promote High performance for both systems.
DBMS — tuned for OLTP: access methods, indexing, searching, concurrency
control, recovery
Warehouse—tuned for OLAP: complex OLAP queries, computations,
multidimensional view, consolidation.
If we use OLAP in operational DB – it degrades performance
OLTP supports concurrency & recovery - if these applied on DWH it reduces
the throughput.
DSS require historical data but operational DB do not maintain.
DSS requires consolidation of data from heterogeneous sources but
operational contains only raw data.
Two systems support quite different functionalities. Thus we need
separate DWH.
Many vendors trying to optimize OLTP db, so that they support OLAP
in future.
G.Karuna, GRIET
April 14, 2022 6
From Tables and Spreadsheets to Data Cubes
A data warehouse & OLAP tools are based on a multidimensional data
model which views data in the form of a data cube.
A data cube, allows data to be modeled and viewed in multiple
dimensions. i.e data cube is defined by dimensions and facts.
Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables.
The physical structure of DWH is a data cube.
Data cube provides multidimensional view and allows pre-computation
and fast accessing of consolidated data.
G.Karuna, GRIET
April 14, 2022 7
From Tables and Spreadsheets to Data Cubes
Ex:- All electronics company may create DWH to keep records of sales
with respect to 4 dimensions item, time, supplier, location.
Each dimension may associated with a table – Dimension table.
Ex:- item is dimension (item-name, type, ….)
A data cube organized around central theme – Fact table.
Ex:- sales (dollas_sold, units_sold)
In DWH, data cube is n-dimensional and it is combination of
dimensions and fact tables.
Before multi-dimensional model we start with 2D cube.
G.Karuna, GRIET
April 14, 2022 8
From Tables and Spreadsheets to Data Cubes
In data warehousing literature, a data cube is also referred as a
cuboid.
Given set of dimensions we can generate a cubiod for each of
possible subsets. The result would form a lattice of cuboids which
shows the data at different levels of summarization (or) group by.
The resultant lattice of cuboids is called as n-Dimensional Data cube.
The cuboid that holds lowest level summarization is called a base
cuboid. The top most is 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
G.Karuna, GRIET
April 14, 2022 9
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
G.Karuna, GRIET
April 14, 2022 10
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
Star schema:
Popular and commonly used model.
DWH contains a large central table containing of the
bulk of data with no redundancy called as fact table.
A set of small attendant tables one for each dimension
called as dimension tables (de-normalized tables).
This schema looks like a star burst with central fact and
surrounding with dimension tables.
The fact table contains key attributes of Dimensions.
April 14, 2022 G.Karuna, GRIET 11
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
G.Karuna, GRIET
April 14, 2022 12
Conceptual Modeling of Data Warehouses
Snowflake schema:
A refinement of star schema.
In this some dimension tables are normalized into a
set of smaller dimension tables further. i.e. splitting
the data into additional tables.
Single fact table with multiple normalized dimension
tables forming a shape similar to snowflake.
To reduce redundancy, this model kept as normalized
form.
G.Karuna, GRIET
April 14, 2022 13
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
G.Karuna, GRIET
April 14, 2022 14
Conceptual Modeling of Data Warehouses
Fact constellation schema:
Sophisticated applications may require multiple fact
tables to share multiple dimension tables.
Multiple fact tables with multiple dimension tables,
viewed as a collection of stars, therefore called galaxy
schema or fact constellation.
G.Karuna, GRIET
April 14, 2022 15
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
G.Karuna, GRIET
April 14, 2022 17
Defining Star Schema in DMQL
G.Karuna, GRIET
April 14, 2022 19
Defining Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location
in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
G.Karuna, GRIET
April 14, 2022 20
Measures of Data Cube: Three Categories
A data cube measures - kind of aggregate function used
(i) Distributive: if the result derived by applying the function
to n aggregate values is the same as that derived by
applying the function on all the data without partitioning
E.g., count(), sum(), min(), max()
(ii) Algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function
E.g., avg()
(iii) Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
April 14, 2022
E.g., median(), mode(), rank()
G.Karuna, GRIET
21
OLAP Operations in Data Cube
OLAP operations allow user-friendly environment for
interactive data analysis of Data Cube.
A number of OLAP operations existed to provide flexibility to
view the data in different perspective.
Ex:- Consider All Electronics Sales. The data cube contains
item, time, and location dimensions. Time is aggregated to
Quarters and location is aggregated to city values. The
measure used is dollars-sold.
G.Karuna, GRIET
April 14, 2022 22
OLAP Operations
1. Slice:- Performs a selection on one dimension of given data cube,
resulting sub-cube.
Ex:- slice for time = “Q1”
2. Dice:- Performs a selection on two/more dimensions of given data cube,
resulting sub-cube.
Ex:- dice for (location = “Hyd” or “Bang”) and (time = “Q1” or “Q2”)
and (item=“LCD” or “comp”)
3. Roll up (drill-up): Performs aggregation on data cube either by climbing
up a concept hierarchy for a dimension or dimension reduction.
Ex:- roll-up on location (from cities to countries)
4. Drill down (roll down): reverse of roll-up, either by stepping down a
concept hierarchy for a dimension or introducing additional dimensions.
Ex:- drill-down on time (from quarters to months)
G.Karuna, GRIET
April 14, 2022 23
OLAP Operations
5. Pivot:- is a visualization operation that rotates the data axes in view in
order to provide an alternative presentation of data.
Ex:- pivot on item and location ( these axes re rotated)
table
Drill through: through the bottom level of the cube to its back-end
G.Karuna, GRIET
April 14, 2022 24
OLAP Operations
G.Karuna, GRIET
April 14, 2022 29
Data Warehouse Back-End Tools and Utilities
Data extraction
get data from multiple, heterogeneous, and external
sources
Data cleaning
detect errors in the data and rectify them when possible
Data transformation
convert data from legacy or host format to warehouse
format
Load
sort, summarize, consolidate, compute views, check
warehouse
G.Karuna, GRIET
April 14, 2022 30
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
Operational meta-data
data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging policies
April 14, 2022 G.Karuna, GRIET 31
Three tier DWH Architecture
G.Karuna, GRIET
April 14, 2022 34
A Recommended Approach for DWH Development
Top-down – serves a systematic solution & minimizes integration
problems. But it is expensive, lack of flexibility.
Bottom-up – provides flexibility, low cost and rapid development but
integration is difficult.
In recommended Approach,
(i) High level corporate model is defined (within short period) that provide
corporate, consistent and integrated view of data among various
subjects. This need to refined further development of enterprise DWH/
data mart.
(ii) Data marts can be implemented parallel with DWH, based on same
corporate model set.
(iii) Distributed data marts can be constructed to integrate different DM’s.
(iv) A multitier DWH is constructed where the enterprise is sole custodian
of all WH data then it is distributed to various dependent data marts.
April 14, 2022 G.Karuna, GRIET 35
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Hence, the join index records can identify joinable tuples. Join indexing
is especially useful for maintaining the relationship between a foreign key
and its matching primary keys, from the joinable relation.
In DWH, join indexing is useful for cross table search because of star
schema model of DWH.
Cube Materialization:
(i) Base Cell:- A cell in base cuboid is base cell.
(ii) Aggregate Cell:- A cell in non-base cuboid. Each aggregated
dimension indicated as ‘*’.
Ex:- If we take 3D-cube all 1-D, 2-D cells are aggregate cells and
3-D is Base cell.
(iii) Ancestor & Descendent cells:- 1-D and 2-D cells are ancestors
of 3-D cell. 3-D is descendent cell.
G.Karuna, GRIET 44
Cube Materialization:
Full Cube vs. Iceberg Cube
Full cube:- Computation all the cells of all the
cuboids in data cube.
G.Karuna, GRIET 46
Multi-way Array Aggregation for Cube
Computation (MOLAP)
Partition arrays into chunks (a small subcube which fits in memory).
Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in “multiway” by visiting cube cells in the order which
minimizes the # of times to visit each cell, and reduces memory access and
storage cost.
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32 What is the best
c0
b3 B13 14 15 16 60 traversing order
44
9
28 56 to do multi-way
b2
B 40
24 52 aggregation?
b1 5 36
20
b0 1 2 3 4
a0 a1 a2 a3
A G.Karuna, GRIET 47
Multi-Way Array Aggregation
All
Array-based “bottom-up” algorithm
Using multi-dimensional chunks
A B C
No direct tuple comparisons
Simultaneous aggregation on multiple AB AC BC
dimensions.
The best order is the one that ABC
G.Karuna, GRIET 48
Data Mining: Concepts and
April 14, 2022 Techniques 49
Multi-Way Array Aggregation for Cube
Computation
Method: the planes should be sorted and computed
according to their size in ascending order
Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the
largest plane
Limitation of the method: computing well only for a small
number of dimensions
If there are a large number of dimensions, “top-down”
computation and iceberg cube computation methods
can be explored.
G.Karuna, GRIET 50
Star-Cubing: An Integrating Method
D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes
by Top-Down and Bottom-Up Integration, VLDB'03
Explore shared dimensions
E.g., dimension A is the shared dimension of ACD and AD
ABD/AB means cuboid ABD has shared dimensions AB
Allows for shared computations
e.g., cuboid AB is computed simultaneously as ABD
C /C D
Aggregate in a top-down
manner but with the bottom-up
A C /A C A D /A B C /B C B D /B CD
sub-layer underneath which will
allow Apriori pruning A C D /A
A B C /A B C A B D /A B BCD
Shared dimensions grow in
bottom-up fashion A B C D /a ll
G.Karuna, GRIET 51
Further Development of Data Cube
G.Karuna, GRIET
Complex Aggregation at Multiple
Granularities: Multi-Feature Cubes
Multi-feature cubes : Compute complex queries involving multiple
dependent aggregates at multiple granularities.
Many complex data mining queries can be answered by multi feature
cubes without any significant increase in computational cost.
Ex. Grouping by all subsets of {item, region, month}, find the
maximum price in 1997 for each group, and the total sales among all
maximum price tuples
select item, region, month, max(price), sum(R.sales)
from purchases
where year = 1997
cube by item, region, month: R
such that R.price = max(price)
significance constraint
probe constraint
gradient constraint
Both OLAM and OLAP servers accept user queries via GUI API and
work with data cube via cube API.
OLAM server may perform multiple data mining tasks. This consisting
of integrated DM modules and these are more sophisticated than
OLAP server.
April 14, 2022 G.Karuna, GRIET 61
An OLAM System Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository
April 14, 2022 G.Karuna, GRIET 62
Data Generalization
Data Generalization is a process that abstracts a large set of
task-relevant data in a DB/DWH from a relatively low-level
concepts to high level concepts.
G.Karuna, GRIET
65
Attribute-Oriented Induction: An Example
Step 2. Fetch task relevant set of data using an SQL statement.
i.e. transform into SQL statement.
Use Big_Univ_DB
Select name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in {“M.Sc.”, “MBA”, “B.Sc.”, “B.A.”,
”B.Com” ,”M.Com”,”MCA”, “MBBS” }.
Now the data is stored in a table called as initial working
relation (IWR).
G.Karuna, GRIET 66
Attribute-Oriented Induction: An Example
Step 3. Perform data generalization using 2 methods attribute
removal and attribute generalization.
Attribute-removal: remove attribute A if there is a large set of
distinct values for A but (1) there is no generalization operator
on A, or (2) A’s higher level concepts are expressed in terms of
other attributes
Attribute-generalization: If there is a large set of distinct values
for A, and there exists a set of generalization operators on A,
then select an operator and generalize A .
Attribute-threshold control: threshold default range 2-8
Generalized relation threshold control: control the final
relation/rule size. The default is 10-30.
G.Karuna, GRIET 67
Attribute-Oriented Induction: An Example
Name - remove attribute because of distinct values
68
Class Characterization: An Example
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
69