0% found this document useful (0 votes)
23 views112 pages

Intoduction DW

Uploaded by

notesbook14925
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views112 pages

Intoduction DW

Uploaded by

notesbook14925
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

Data Warehousing

Introduction and Background

1
Reference Books
 W. H. Inmon, Building the Data Warehouse
(Second Edition), John Wiley & Sons Inc., NY.
 A. Abdullah, “Data Warehousing for
beginners: Concepts & Issues” (First
Edition).
 Paulraj Ponniah, Data Warehousing
Fundamentals, John Wiley & Sons Inc., NY.

2
Introduction
► An organization needs data from and in
various forms from heterogeneous sources.
► Data warehousing refers to the process of
combining and storing data of different
databases of an organization.
► In 1980, Bill Inmon coined the term ‘Data
Warehouse’.
Introduction
► The operational computer systems - provide
information to run the day-to-day operations
► Executives - strategic decisions.
► Where to build the next warehouse, which
product lines to expand, and which markets
to strengthen.
► Data warehousing is a new paradigm
specifically intended to provide vital strategic
information. 4
The need

“Drowning in data and starving


for information”
Knowledge is power, Intelligence
is absolute power!

5
The need
$
POWER

INTELLIGENCE

KNOWLEDGE

INFORMATION

DATA

6
Strategic Information
► Not for running the day-to-day operations of
the business.
► It is not intended to produce an invoice,
make a shipment, settle a claim, or post a
withdrawal from a bank account.
► More important for the continued health and
survival of the corporation.
► Critical business decisions depend on

7
Strategic Information
Must have a single, enterprise-wide
INTEGRATED
view.
DATA Information must be accurate and
INTEGRITY must conform to business rules.
Easily accessible with intuitive access
ACCESSIBLE
paths, and responsive for analysis.
Every business factor must have one
CREDIBLE
and only one value.
Information must be available within
TIMELY
the stipulated time frame.
8
Strategic Information
► A business unit of a leading long-distance telephone carrier
empowers its sales personnel to make better business decisions
and thereby capture more business in a highly competitive,
multibillion-dollar market. A Web-accessible solution gathers
internal and external data to provide strategic information.
► Availability of strategic information at one of the largest banks in
the United States with assets in the $250 billion range allows
users to make quick decisions to retain their valued customers.
► In the case of a large health management organization,
significant improvements in health care programs are realized,
resulting in a 22% decrease in emergency room visits, 29%
decrease in hospital admissions for asthmatic children,
potentially sight-saving screenings for hundreds of diabetics,
improved vaccination rates, and more than 100,000
performance reports created annually for physicians and
pharmacists. 9
Historical overview

1960
Master Files & Reports

1965
Lots of Master files!

1970
Direct Access Memory & DBMS

1975
Online high performance transaction processing 

10
Historical overview

1980
PCs and 4GL Technology (MIS/DSS) 
1985 & 1990 
Extract programs, extract processing,
The legacy system’s web

11
Historical overview: Crisis of Credibility
What is the financial health of our company?

??

 

-10%

+10%



12
Operational Systems
► Operational systems are online transaction processing
(OLTP) systems.
► To run the day-to-day core business of the company.
► Bread-and-butter systems.
► Operational systems make the wheels of business
turn.
► Support the basic business processes of the company.
► These systems typically get the data into the
database.
► Each transaction processes information about a single
entity such as a single order, a single invoice, or a
single customer. 13
Operational Systems

14
DSS

15
DSS
► Specially designed and built DSSs are not
meant to run the core business processes.
They are used to watch how the business
runs, and then make strategic decisions to
improve the business.
► Decision-support systems are developed to
get strategic information out of the
database, as opposed to OLTP systems that
are designed to put the data into the
database 16
Failures of Past Decision Support
Systems
Business user
needs info

Answers result
User requests
in more questions
IT people

?
Business user
may get answers
 IT people do
system analysis
and design

IT people
send reports to IT people
business user create reports

17
Inability to Provide Information
► Strategic information was been extracted from the
existing operational systems.
► These operational systems such as University Record
system, inventory management, claims processing,
outpatient billing, and so on are not designed in a way
to provide strategic information.
► If we need the strategic information, the information
must be collected from altogether different types of
systems.
► Only specially designed decision support systems or
informational systems can provide strategic
information. 18
Inability to Provide Information
Operational Systems Informational Systems
Current Values are the Data Content Data is Archived, Derived,
Summarized
Data structure is Optimized for Data structure is Optimized for
transactions complex queries

Access frequency is High Access frequency is Medium to low

Data Access type is Read, Update, Data access type is only Read
Delete
Usage is Predictable, Repetitive Usage is Ad hoc, Random, Heuristic

Response time is in Sub – seconds Response time is in Several seconds to


minutes
Large Number of users Relatively small number of users
19
Business Intelligence at the
Data Warehouse

20
Data Warehouse

21
Why a Data Warehouse (DWH)?
► Data recording and storage is growing.

► History is excellent predictor of the future.

► Gives total view of the organization.

► Intelligent
decision-support is required for
decision-making.

22
Reason-1: Why a Data Warehouse?
► Data Sets are growing.

How Much Data is that?


1 MB 220 or 106 bytes Small novel – 31/2 Disk
Paper rims that could fill the back of
1 GB 230 or 109 bytes
a pickup van
50,000 trees chopped and converted
1 TB 240 or 1012 bytes
into paper and printed
Academic research libraries across
2 PB 1 PB = 250 or 1015 bytes
the U.S.
All words ever spoken by human
5 EB 1 EB = 260 or 1018 bytes
beings

23
Reason-1: Why a Data Warehouse?
► Sizeof Data Sets are going up .
► Cost of data storage is coming down .

 The amount of data average business collects


and stores is doubling every year

 Total hardware and software cost to store and


manage 1 Mbyte of data
►1990: ~ $15
►2002: ~ ¢15 (Down 100 times)
►By 2007: < ¢1 (Down 150 times)

24
Reason-1: Why a Data Warehouse?
 A Few Examples
►WalMart: 24 TB
►France Telecom: ~ 100 TB
►CERN: Up to 20 PB by 2006
►Stanford Linear Accelerator Center (SLAC):
500TB

25
Caution!

A Warehouse of Data
is NOT a
Data Warehouse

26
Caution!

Size
is NOT
Everything

27
Reason-2: Why a Data Warehouse?

►Businesses demand Intelligence (BI).


 Complex questions from integrated data.
 “Intelligent Enterprise”

28
Reason-2: Why a Data Warehouse?
DBMS Approach
List of all items that were sold last
month?

List of all items purchased by Tariq


Majeed?

The total sales of the last month


grouped by branch?

How many sales transactions


occurred during the month of
January?
29
Reason-2: Why a Data Warehouse?
Intelligent Enterprise
Which items sell together? Which
items to stock?

Where and how to place the items?


What discounts to offer?

How best to target customers to


increase sales at a branch?

Which customers are most likely to


respond to my next promotional
campaign, and why?
30
Reason-3: Why a Data Warehouse?
►Businesses want much more…

 What happened?
 Why it happened? Stages of
Data
 What will happen? Warehouse

 What is happening?
 What do you want to happen?
31
What is a Data Warehouse?

A complete repository of historical


corporate data extracted from
transaction systems that is
available for ad-hoc access by
knowledge workers.

32
What is a Data Warehouse?
Complete repository
History
Transaction System
Ad-Hoc access
Knowledge workers

33
What is a Data Warehouse?
Transaction System
 Management Information System (MIS)
 Could be typed sheets (NOT transaction system)

Ad-Hoc access
 Dose not have a certain access pattern.
 Queries not known in advance.
 Difficult to write SQL in advance.

Knowledge workers
 Typically NOT IT literate (Executives, Analysts, Managers).
 NOT clerical workers.
 Decision makers. 34
Another View of a DWH

Subject
Oriented

Integrated

Time
Variant

Non
Volatile

35
What is a Data Warehouse ?
It is a blend of many technologies, the basic
concept being:

 Take all data from different operational systems.


 If necessary, add relevant data from industry.

 Transform all data and bring into a uniform format.

 Integrate all data as a single entity.

36
What is a Data Warehouse ? (Cont…)
It is a blend of many technologies, the basic
concept being:

Store data in a format supporting easy access for


decision support.
 Create performance enhancing indices.

 Implement performance enhancement joins.

 Run ad-hoc queries with low selectivity.

37
How is it Different?
► Different patterns of hardware utilization

100%

0%

Operational DWH

Bus Service vs. Train


38
How is it Different?
► Combines operational and historical data.
 Don’t do data entry into a DWH, OLTP or ERP are the source
systems.

 OLTP systems don’t keep history, cant get balance statement


more than a year old.
 DWH keep historical data, even of bygone customers. Why?

 In the context of bank, want to know why the customer left?

 What were the events that led to his/her leaving? Why?


 Customer retention.

39
How much history?

► Depends on:
 Industry.
 Cost of storing historical data.
 Economic value of historical data.

40
How much history?
► Industries and history
 Telecomm calls are much much more as compared to
bank transactions- 18 months.

 Retailers interested in analyzing yearly seasonal


patterns- 65 weeks.
 Insurance companies want to do actuary analysis, use
the historical data in order to predict risk- 7 years.

41
How much history?

Economic value of data


Vs.
Storage cost

Data Warehouse a
complete repository of data?
42
How is it Different?
► Usually
(but not always) periodic or batch
updates rather than real-time.

 The boundary is blurring for active data warehousing.

 For an ATM, if update not in real-time, then lot of real


trouble.
 DWH is for strategic decision making based on historical
data. Wont hurt if transactions of last one hour/day are
absent.

43
How is it Different?

 Rate of update depends on:


 volume of data,
 nature of business,
 cost of keeping historical data,
 benefit of keeping historical data.

44
How is it Different?
► Does
not follows the traditional development
model
Requirements

 Program

Classical SDLC

 Requirements gathering
 Analysis
 Design
 Programming
 Testing
 Integration
 Implementation
45
How is it Different?
► Does
not follows the traditional development
model DWH

Program

 Requirements
DWH SDLC (CLDS)

 Implement warehouse
 Integrate data
 Test for biasness
 Program w.r.t data
 Design DSS system
 Analyze results
 Understand requirement 46
Data Warehouse Vs. OLTP

OLTP (On Line Transaction Processing)


Select tx_date, balance from tx_table
Where account_ID = 23876;

47
Data Warehouse Vs. OLTP

DWH
Select balance, age, sal, gender from
customer_table, tx_table
Where age between (30 and 40) and
Education = ‘graduate’ and
CustID.customer_table =
Customer_ID.tx_table;

48
Data Warehouse Vs. OLTP

OLTP DWH
Primary key used Primary key NOT used
No concept of Primary Index Primary index used
Few rows returned Many rows returned

May use a single table Uses multiple tables


High selectivity of query Low selectivity of query
Indexing on primary key Indexing on primary index
(unique) (non-unique)

49
Data Warehouse Vs. OLTP
OLTP: OnLine Transaction Processing (MIS or Database System)

Data Warehouse OLTP


Scope * Application –Neutral * Application specific
* Single source of “truth” * Multiple databases with repetition
* Evolves over time * Off the shelf application
* How to improve business * Runs the business

Data * Historical, detailed data * Operational data


Perspective * Some summary * No summary
* Lightly denormalized * Fully normalized

Queries * Hardly uses PK * Based on PK


* Number of results * Number of results returned in
returned in thousands hundreds

Time factor * Minutes to hours * Sub seconds to seconds


* Typical availability 6x12 * Typical availability 24x7
50
Scenario 1

ABC Pvt. Ltd is a company with branches at


Mumbai, Delhi, Chennai and Bangalore.
The Sales Manager wants quarterly sales
report.
Each branch has a separate operational
system.
Scenario 1 : ABC Pvt Ltd.
Mumbai

Delhi
Sales per item type per branch Sales
for first quarter. Manager

Chennai

Banglore
Solution 1:ABC Pvt Ltd.

► Extract sales information from each database.


► Store the information in a common repository
at a single site.
Solution 1:ABC Pvt Ltd.
Mumbai

Report
Delhi
Query & Sales
Data Analysis tools Manager
Warehouse

Chennai

Banglore
Scenario 2

One Stop Shopping Super Market has huge


operational database.
Whenever Executives wants some report the
OLTP system becomes slow and data entry
operators have to wait for some time.
Scenario 2 : One Stop Shopping

Data Entry Operator


Report

Wait Operational Management


Database

Data Entry Operator


Solution 2
► Extract data needed for analysis from
operational database.
► Store it in another system, the data
warehouse.
► Refresh warehouse at regular intervals so that
it contains up to date information for analysis.
► Warehouse will contain data with historical
perspective.
Solution 2

Data Entry
Operator

Report

Transaction Extract Data


Operational Manager
data Warehouse
database

Data Entry
Operator
Scenario 3

Cakes & Cookies is a small, new company. The


chairman of this company wants his company to
grow. He needs information so that he can
make correct decisions.
Solution 3
► Improve the quality of data before loading it
into the warehouse.
► Perform data cleaning and transformation
before loading the data.
► Use query analysis tools to support adhoc
queries.
What is an operational system?
A. A system that is used to run the business
in real time and is based on historical data.
B. A system that is used to run the business in
real time and is based on current data.
C. A system that is used to support decision
making and is based on current data.
D. A system that is used to support decision
making and is based on historical data. 61
Characteristics of Data Warehouse

Subject
Oriented

Integrated

Time
Variant

Non
Volatile

62
Characteristics of Data Warehouse

► Subject oriented. Data are organized based on


how the users refer to them.
► Integrated. All inconsistencies regarding naming
convention and value representations are
removed.
► Nonvolatile. Data are stored in read-only format
and do not change over time.
► Time variant. Data are not current but normally
time series.
Characteristics of Data Warehouse
► Summarized Operational data are mapped into a
decision-usable format
► Large volume. Time series data sets are
normally quite large.
► Not normalized. DW data can be, and often are,
redundant.
► Metadata. Data about data are stored.
► Data sources. Data come from internal and
external unintegrated operational systems.
Subject Orientation
Application Environment Data warehouse
Environment
Design activities must be equally DW world is primarily void of process
focused on both process and database design and tends to focus exclusively on
design issues of data modeling and database
design
67
Data Integrated
► Integration – consistency naming
conventions and measurement attributers,
accuracy, and common aggregation.
► Establishment of a common unit of measure
for all synonymous data elements from
dissimilar database.
► The data must be stored in the DW in an
integrated, globally acceptable manner
Integrated

69
Time Variant
► In an operational application system, the
expectation is that all data within the database are
accurate as of the moment of access. In the DW
data are simply assumed to be accurate as of
some moment in time and not necessarily right
now.
► One of the places where DW data display time
variance is in the structure of the record key.
Every primary key contained within the DW must
contain, either implicitly or explicitly an element of
time( day, week, month, etc)
Nonvolatility

71
Nonvolatility
Application DW
The design issues must focus on data Such issues are no concern to in a DW
integrity and update anomalies. environment because data update is
Complex processes must be coded to never performed.
ensure that the data update processes
allow for high integrity of the final
product.
Data is placed in normalized form to Designers find it useful to store many
ensure a minimal redundancy (totals of such calculations or
that could be calculated would never summarizations.
be stored)
The technologies necessary to support Relative simplicity in technology
issues of transaction and data
recovery, roll back, and detection and
remedy of deadlock are quite
complex.
Data Granularity

73
The Data Mart

► It
is lower-cost, scaled down version of the
DW.

► Data Mart offer a targeted and less costly


method of gaining the advantages associated
with data warehousing and can be scaled up to a
full DW environment over time.
Misconception about data Mart
►A data mart is not warehouse.
► A data mart is not just a small data
warehouse
► A collection of data mart is not a data
warehouse
► A data warehouse is not DSS

75
How are They Different?
DATA WAREHOUSE DATA MART
 Corporate/Enterprise-wide  Departmental
 Union of all data marts  A single business process
 Data received from staging  Star-join (facts &
area dimensions)
 Queries on presentation  Technology optimal for
resource data access and analysis
 Structure for corporate view of  Structure to suit the
data departmental view of data
 Organized on E-R model
DATA WAREHOUSES AND
DATA MARTS
► Top-down or bottom-up approach?
► Enterprise-wide or departmental?
► Which first-data warehouse or data
mart?
► Build pilot or go with a full-fledged
implementation?
► Dependent or independent data marts?
What is a Data Warehouse
Architecture
► Primarily based on the business processes
of a business enterprise
► Conceptualization of how the data
warehouse is built
Top-Down Approach
Advantages :
• A truly corporate effort, an enterprise view of data
• Inherently architected—not a union of disparate
data marts
• Single, central storage of data about the content
• Centralized rules and control
• May see quick results if implemented with iterations
Disadvantages :
• Takes longer to build even with an iterative method
• High exposure/risk to failure
• Needs high level of cross-functional skills
• High outlay without proof of concept
• Data Fragmentation
Top-Down Approach

80
Bottom-Up Approach
Advantages :
• Faster and easier implementation of manageable
pieces
• Favorable return on investment and proof of concept
• Less risk of failure
• Inherently incremental; can schedule important data
marts first
• Allows project team to learn and grow
Disadvantages are:
• Each data mart has its own narrow view of data
• Permeates redundant data in every data mart
• Perpetuates inconsistent and irreconcilable data
• Proliferates unmanageable interfaces
Bottom-Up Approach

Data Mart

82
Hybrid Approach
► The Hybrid approach aims to harness the speed and user
orientation of the Bottom up approach to the integration of
the top-down approach.
► The Hybrid approach begins with an Entity Relationship
diagram of the data marts and a gradual extension of the
data marts to extend the enterprise model in a consistent,
linear fashion.
► These data marts are developed using the star schema or
dimensional models.
► The Extract, Transform and Load (ETL) tool is deployed to
extract data from the source into a non persistent staging
area and then into dimensional data marts that contain
both atomic and summary data.
83
Federated Approach
► This is a hub-and-spoke architecture often described as the
“architecture of architectures”.
► It recommends an integration of heterogeneous data
warehouses, data marts and packaged applications that
already exist in the enterprise.
► The goal is to integrate existing analytic structures
wherever possible and to define the “highest value”
metrics, dimensions and measures and share and reuse
them within existing analytic structures.
► This may result in the creation of a common staging area
to eliminate redundant data feeds or building of a data
warehouse that sources data from multiple data marts,
data warehouses or analytic applications. 84
A Practical Approach
1. Plan and define requirements at the overall
corporate level
2. Create a surrounding architecture for a
complete warehouse
3. Conform and standardize the data content
4. Implement the data warehouse as a series
of supermarts, one at a time
A Practical Approach
• Supermarts are carefully architected data
marts.
• This will avoid spread of disparate data across
several data marts.
• A data mart, is a logical subset of the complete
data warehouse, a sort of pie-wedge of the
whole data warehouse.
• A data warehouse, therefore, is a conformed
union of all data marts.
• Individual data marts are targeted to particular
business groups in the enterprise
87
88
Components
► Major components
 Source data component
 Data staging component
 Information delivery component
 Metadata component
 Management and control component

89
1. Source Data Components
► Source data can be grouped into 4 components
 Production data
► Comes from operational systems of enterprise
► Some segments are selected from it
► Narrow scope, e.g. order details
► Many variations in the data formats.
► Data resides on different hardware platforms.
► Different database systems and operating systems.
► Data from many vertical applications.
► Great challenge is to standardize and transform the
disparate data from the various production systems,
convert the data, and integrate the pieces into useful data
for storage in the data warehouse.
 Internal data
► Private datasheet, documents, customer profiles etc.
► E.g. Customer profiles for specific offering
► Special strategies to transform ‘it’ to DW (text document)
90
1. Source Data Components

 Archived data
► Olddata is archived
► DW have snapshots of historical data
 External data
► Executives depend upon external sources
► Some sources may provide information at regular, stipulated
intervals others may give you the data on request.
► E.g. market data of competitors, car rental require new
manufacturing. Define conversion

91
92
2. Data Staging Component
► After data is extracted, data is to be prepared
► Data extracted from sources needs to be changed,
converted and made ready in suitable format
► Three major functions to make data ready
 Extract
 Transform
 Load
► Staging area provides a place and area with a set
of functions to
 Clean
 Change
 Combine
 Convert

93
94
3. Data Storage Component

► Separate repository
► Data structured for efficient processing
► Redundancy is increased
► Updated after specific periods
► Only read-only

95
96
4. Information Delivery Component
1. The novice user - prefabricated reports and
preset queries.
2. The casual user - prepackaged information.
3. The business analyst - complex analysis
using information
4. The power user- navigate throughout the
data warehouse, pick up interesting data,
format his or her own queries, drill through
the data layers, and create custom reports
and ad hoc queries.
4. Information Delivery Component
• Ad hoc reports - novice and casual users.
• complex queries, multidimensional (MD) analysis,
and statistical analysis -business analysts and
power users.
• EIS feed - senior executives and high-level
managers.
• Data-mining - to discover trends and patterns
from the usage of your data.

Information delivery mechanisms-online queries


and reports. e-mail, intranet, Internet
99
4. Information Delivery Component

► Authentication issues
► Active monitoring services
 Performance, DBA note selected aggregates to
change storage
 User performance
 Aggregate awareness
 E.g. mining, OLAP etc

100
101
The Metadata
► The name suggests some high-level technological
concept, but it really is fairly simple. Metadata is
“data about data”.
► With the emergence of the data warehouse as a
decision support structure, the metadata are
considered as much a resource as the business
data they describe.
► Metadata are abstractions -- they are high level
data that provide concise descriptions of lower-
level data.
103
Metadata
► Information about data ware house ,location,
Structure
► Information regarding refreshment of warehouse
cleanup
► Information regarding security authentication and
usage statistics
► Information regarding characteristics of components

104
Significance Of meta data
► It act as glue that connect all part of data
warehouse.
► It provide information about the contain and
structure to the developers
► It open the door to the end users and
makes the contents recognizable in their
own terms.

105
The Metadata

For example, a line in a sales database may contain:


4056 KJ596 223.45

This is mostly meaningless until we consult the metadata that


tells us it was store number 4056, product KJ596 and sales
of $223.45

The metadata are essential ingredients in the transformation


of raw data into knowledge. They are the “keys” that allow
us to handle the raw data.
Types of Metadata
► Operational Metadata
► Extraction and transformation Metadata
► Enduser Metadata
► Semantic metadata

107
General Metadata Issues
General metadata issues associated with Data
Warehouse use:
 What tables, attributes and keys does the DW contain?
 Where did each set of data come from?
 What transformations were applied with cleansing?
 How have the metadata changed over time?
 How often do the data get reloaded?
 Are there so many data elements that you need to be
careful what you ask for?
Management and Control
Component
• It coordinates the services and activities within the data
warehouse.
• This component controls the data transformation and the data
transfer into the data warehouse storage.
• On the other hand, it moderates the information delivery to
the users.
• It works with the database management systems and enables
data to be properly stored in the repositories.
• It monitors the movement of data into the staging area and
from there into the data warehouse storage itself.
• This interacts with the metadata component to perform the
management and control functions.
• Metadata is the source of information for the management
module
110
111
Putting the pieces together

Data Data Warehouse Server OLAP Servers Clients


(Tier 0) (Tier 1) (Tier 2) (Tier 3)


Semistructured MOLAP
Sources Query/Reporting

www data
Meta
Data 
 Extract
Data 
 
Analysis







 Archived
Transform
Load Warehouse 
 data
(ETL) ROLAP Business
IT Data Mining
Users
Users
Operational
Data Bases 

Data sources Data Marts  Tools
Business Users

112

You might also like