Intoduction DW
Intoduction DW
                               1
              Reference Books
 W. H. Inmon, Building the Data Warehouse
  (Second Edition), John Wiley & Sons Inc., NY.
 A.    Abdullah,   “Data   Warehousing       for
  beginners: Concepts & Issues” (First
  Edition).
 Paulraj Ponniah, Data Warehousing
  Fundamentals, John Wiley & Sons Inc., NY.
                                                    2
             Introduction
► An  organization needs data from and in
  various forms from heterogeneous sources.
► Data warehousing refers to the process of
  combining and storing data of different
  databases of an organization.
► In 1980, Bill Inmon coined the term ‘Data
  Warehouse’.
               Introduction
► The   operational computer systems - provide
  information to run the day-to-day operations
► Executives - strategic decisions.
► Where to build the next warehouse, which
  product lines to expand, and which markets
  to strengthen.
► Data warehousing is a new paradigm
  specifically intended to provide vital strategic
  information.                                  4
            The need
                              5
The need
      $
  POWER
INTELLIGENCE
KNOWLEDGE
INFORMATION
DATA
                 6
        Strategic Information
► Not  for running the day-to-day operations of
  the business.
► It is not intended to produce an invoice,
  make a shipment, settle a claim, or post a
  withdrawal from a bank account.
► More important for the continued health and
  survival of the corporation.
► Critical business decisions depend on
                                              7
        Strategic Information
           Must have a single, enterprise-wide
INTEGRATED
           view.
DATA       Information must be accurate and
INTEGRITY must conform to business rules.
           Easily accessible with intuitive access
ACCESSIBLE
           paths, and responsive for analysis.
           Every business factor must have one
CREDIBLE
           and only one value.
           Information must be available within
TIMELY
           the stipulated time frame.
                                                 8
             Strategic Information
► A business unit of a leading long-distance telephone carrier
  empowers its sales personnel to make better business decisions
  and thereby capture more business in a highly competitive,
  multibillion-dollar market. A Web-accessible solution gathers
  internal and external data to provide strategic information.
► Availability of strategic information at one of the largest banks in
  the United States with assets in the $250 billion range allows
  users to make quick decisions to retain their valued customers.
► In the case of a large health management organization,
  significant improvements in health care programs are realized,
  resulting in a 22% decrease in emergency room visits, 29%
  decrease in hospital admissions for asthmatic children,
  potentially sight-saving screenings for hundreds of diabetics,
  improved vaccination rates, and more than 100,000
  performance reports created annually for physicians and
  pharmacists.                                                        9
     Historical overview
                   1960
           Master Files & Reports
                     1965
             Lots of Master files!
                    1970
       Direct Access Memory & DBMS
                    1975
Online high performance transaction processing   
                                                     10
   Historical overview
             1980
PCs and 4GL Technology (MIS/DSS)        
            1985 & 1990                 
Extract programs, extract processing,
       The legacy system’s web
                                            11
Historical overview: Crisis of Credibility
  What is the financial health of our company?
??
                                  
                                      
    -10%
             
                                     +10%
                          
                              
                                                 12
          Operational Systems
► Operational systems are online transaction processing
  (OLTP) systems.
► To run the day-to-day core business of the company.
► Bread-and-butter systems.
► Operational systems make the wheels of business
  turn.
► Support the basic business processes of the company.
► These systems typically get the data into the
  database.
► Each transaction processes information about a single
  entity such as a single order, a single invoice, or a
  single customer.                                    13
Operational Systems
                      14
DSS
      15
                   DSS
► Specially designed and built DSSs are not
  meant to run the core business processes.
  They are used to watch how the business
  runs, and then make strategic decisions to
  improve the business.
► Decision-support systems are developed to
  get strategic information out of the
  database, as opposed to OLTP systems that
  are designed to put the data into the
  database                                 16
Failures of Past Decision Support
             Systems
                        Business user
                          needs info
          Answers result
                                        User requests
        in more questions
                                          IT people
                             ?
       Business user
      may get answers
                                          IT people do
                                         system analysis
                                            and design
              IT people
           send reports to         IT people
            business user        create reports
                                                           17
Inability to Provide Information
► Strategic information was been extracted from the
  existing operational systems.
► These operational systems such as University Record
  system, inventory management, claims processing,
  outpatient billing, and so on are not designed in a way
  to provide strategic information.
► If we need the strategic information, the information
  must be collected from altogether different types of
  systems.
► Only specially designed decision support systems or
  informational     systems    can    provide    strategic
  information.                                           18
Inability to Provide Information
    Operational Systems                    Informational Systems
Current Values are the Data Content      Data   is      Archived,     Derived,
                                         Summarized
Data structure    is    Optimized     for Data structure is    Optimized      for
transactions                              complex queries
Data Access type is Read, Update, Data access type is only Read
Delete
Usage is Predictable, Repetitive         Usage is Ad hoc, Random, Heuristic
                               20
Data Warehouse
                 21
   Why a Data Warehouse (DWH)?
► Data    recording and storage is growing.
► Intelligent
            decision-support is required for
  decision-making.
                                                    22
Reason-1: Why a Data Warehouse?
► Data    Sets are growing.
                                                                           23
Reason-1: Why a Data Warehouse?
► Sizeof Data Sets are going up .
► Cost of data storage is coming down .
                                                    24
Reason-1: Why a Data Warehouse?
   A Few Examples
    ►WalMart:  24 TB
    ►France Telecom: ~ 100 TB
    ►CERN: Up to 20 PB by 2006
    ►Stanford Linear Accelerator Center (SLAC):
     500TB
                                                  25
     Caution!
A Warehouse of Data
     is NOT a
  Data Warehouse
                      26
 Caution!
   Size
 is NOT
Everything
             27
Reason-2: Why a Data Warehouse?
                                              28
Reason-2: Why a Data Warehouse?
             DBMS Approach
     List of all items that were sold last
     month?
    What happened?
    Why it happened?         Stages of
                                Data
    What will happen?       Warehouse
    What is happening?
    What do you want to happen?
                                      31
   What is a Data Warehouse?
                                  32
What is a Data Warehouse?
Complete repository
History
Transaction System
Ad-Hoc access
Knowledge workers
                            33
What is a Data Warehouse?
Transaction System
  Management Information System (MIS)
   Could be typed sheets (NOT transaction system)
Ad-Hoc access
 Dose not have a certain access pattern.
   Queries not known in advance.
   Difficult to write SQL in advance.
Knowledge workers
 Typically NOT IT literate (Executives, Analysts, Managers).
 NOT clerical workers.
 Decision makers.                                          34
Another View of a DWH
  Subject
  Oriented
Integrated
                          Time
                          Variant
                                     Non
                                    Volatile
                                               35
      What is a Data Warehouse ?
It is a blend of many technologies, the basic
concept being:
                                                        36
What is a Data Warehouse ? (Cont…)
It is a blend of many technologies, the basic
concept being:
                                                     37
              How is it Different?
► Different   patterns of hardware utilization
100%
0%
Operational DWH
                                                           39
How much history?
 ► Depends   on:
    Industry.
    Cost of storing historical data.
    Economic value of historical data.
                                          40
How much history?
 ► Industries   and history
    Telecomm calls are much much more as compared to
     bank transactions- 18 months.
                                                          41
How much history?
      Data Warehouse a
  complete repository of data?
                                 42
               How is it Different?
► Usually
        (but not always) periodic or batch
 updates rather than real-time.
                                                             43
            How is it Different?
                                          44
          How is it Different?
► Does
     not follows the traditional development
 model
              Requirements
 Program
              Classical SDLC
                                    
               Requirements gathering
               Analysis
               Design
               Programming
               Testing
               Integration
               Implementation
                                           45
          How is it Different?
► Does
     not follows the traditional development
 model        DWH
Program
                                Requirements
             DWH SDLC (CLDS)
              Implement warehouse
              Integrate data
              Test for biasness
              Program w.r.t data
              Design DSS system
              Analyze results
              Understand requirement           46
     Data Warehouse Vs. OLTP
                                        47
     Data Warehouse Vs. OLTP
DWH
Select balance, age, sal, gender from
customer_table, tx_table
Where age between (30 and 40) and
Education = ‘graduate’ and
CustID.customer_table =
Customer_ID.tx_table;
                                        48
        Data Warehouse Vs. OLTP
OLTP                        DWH
Primary key used            Primary key NOT used
No concept of Primary Index Primary index used
Few rows returned           Many rows returned
                                                        49
           Data Warehouse Vs. OLTP
OLTP: OnLine Transaction Processing (MIS or Database System)
 Delhi
            Sales per item type per branch    Sales
                   for first quarter.        Manager
Chennai
Banglore
            Solution 1:ABC Pvt Ltd.
                                             Report
 Delhi
                              Query &                  Sales
                  Data      Analysis tools            Manager
                Warehouse
Chennai
Banglore
              Scenario 2
Data Entry
 Operator
Report
Data Entry
 Operator
              Scenario 3
       Subject
       Oriented
Integrated
                               Time
                               Variant
                                          Non
                                         Volatile
                                                    62
 Characteristics of Data Warehouse
             69
                 Time Variant
► In   an operational application system, the
  expectation is that all data within the database are
  accurate as of the moment of access. In the DW
  data are simply assumed to be accurate as of
  some moment in time and not necessarily right
  now.
► One of the places where DW data display time
  variance is in the structure of the record key.
  Every primary key contained within the DW must
  contain, either implicitly or explicitly an element of
  time( day, week, month, etc)
Nonvolatility
                71
                        Nonvolatility
Application                             DW
The design issues must focus on data    Such issues are no concern to in a DW
integrity and update anomalies.         environment because data update is
Complex processes must be coded to      never performed.
ensure that the data update processes
allow for high integrity of the final
product.
Data is placed in normalized form to    Designers find it useful to store many
ensure a minimal redundancy (totals     of such calculations or
that could be calculated would never    summarizations.
be stored)
The technologies necessary to support Relative simplicity in technology
issues of transaction and data
recovery, roll back, and detection and
remedy of deadlock are quite
complex.
Data Granularity
                   73
               The Data Mart
► It
   is lower-cost, scaled down version of the
  DW.
                                        75
       How are They Different?
DATA WAREHOUSE                      DATA MART
   Corporate/Enterprise-wide          Departmental
   Union of all data marts            A single business process
   Data received from staging         Star-join     (facts      &
    area                                dimensions)
   Queries on presentation            Technology optimal for
    resource                            data access and analysis
   Structure for corporate view of    Structure to suit the
    data                                departmental view of data
   Organized on E-R model
    DATA WAREHOUSES AND
         DATA MARTS
► Top-down   or bottom-up approach?
► Enterprise-wide or departmental?
► Which first-data warehouse or data
  mart?
► Build pilot or go with a full-fledged
  implementation?
► Dependent or independent data marts?
      What is a Data Warehouse
             Architecture
► Primarily based on the business processes
  of a business enterprise
► Conceptualization of how the data
  warehouse is built
         Top-Down Approach
Advantages :
• A truly corporate effort, an enterprise view of data
• Inherently architected—not a union of disparate
  data marts
• Single, central storage of data about the content
• Centralized rules and control
• May see quick results if implemented with iterations
Disadvantages :
• Takes longer to build even with an iterative method
• High exposure/risk to failure
• Needs high level of cross-functional skills
• High outlay without proof of concept
• Data Fragmentation
Top-Down Approach
                    80
       Bottom-Up Approach
Advantages :
• Faster and easier implementation of manageable
  pieces
• Favorable return on investment and proof of concept
• Less risk of failure
• Inherently incremental; can schedule important data
  marts first
• Allows project team to learn and grow
Disadvantages are:
• Each data mart has its own narrow view of data
• Permeates redundant data in every data mart
• Perpetuates inconsistent and irreconcilable data
• Proliferates unmanageable interfaces
Bottom-Up Approach
Data Mart
                     82
              Hybrid Approach
►   The Hybrid approach aims to harness the speed and user
    orientation of the Bottom up approach to the integration of
    the top-down approach.
►   The Hybrid approach begins with an Entity Relationship
    diagram of the data marts and a gradual extension of the
    data marts to extend the enterprise model in a consistent,
    linear fashion.
►   These data marts are developed using the star schema or
    dimensional models.
►   The Extract, Transform and Load (ETL) tool is deployed to
    extract data from the source into a non persistent staging
    area and then into dimensional data marts that contain
    both atomic and summary data.
                                                              83
              Federated Approach
►   This is a hub-and-spoke architecture often described as the
    “architecture of architectures”.
►   It recommends an integration of heterogeneous data
    warehouses, data marts and packaged applications that
    already exist in the enterprise.
►   The goal is to integrate existing analytic structures
    wherever possible and to define the “highest value”
    metrics, dimensions and measures and share and reuse
    them within existing analytic structures.
►   This may result in the creation of a common staging area
    to eliminate redundant data feeds or building of a data
    warehouse that sources data from multiple data marts,
    data warehouses or analytic applications.                 84
         A Practical Approach
1.   Plan and define requirements at the overall
     corporate level
2.   Create a surrounding architecture for a
     complete warehouse
3.   Conform and standardize the data content
4.   Implement the data warehouse as a series
     of supermarts, one at a time
         A Practical Approach
•   Supermarts are carefully architected data
    marts.
•   This will avoid spread of disparate data across
    several data marts.
•   A data mart, is a logical subset of the complete
    data warehouse, a sort of pie-wedge of the
    whole data warehouse.
•   A data warehouse, therefore, is a conformed
    union of all data marts.
•   Individual data marts are targeted to particular
    business groups in the enterprise
87
88
               Components
►       Major components
        Source data component
        Data staging component
        Information delivery component
        Metadata component
        Management and control component
                                            89
           1. Source Data Components
► Source   data can be grouped into 4 components
   Production data
     ► Comes   from operational systems of enterprise
     ► Some segments are selected from it
     ► Narrow scope, e.g. order details
     ► Many variations in the data formats.
     ► Data resides on different hardware platforms.
     ► Different database systems and operating systems.
     ► Data from many vertical applications.
     ► Great challenge is to standardize and transform the
       disparate data from the various production systems,
       convert the data, and integrate the pieces into useful data
       for storage in the data warehouse.
   Internal data
     ► Private datasheet, documents, customer profiles etc.
     ► E.g. Customer profiles for specific offering
     ► Special strategies to transform ‘it’ to DW (text document)
                                                                     90
       1. Source Data Components
 Archived data
  ► Olddata is archived
  ► DW have snapshots of historical data
 External data
  ► Executives  depend upon external sources
  ► Some sources may provide information at regular, stipulated
    intervals others may give you the data on request.
  ► E.g. market data of competitors, car rental require new
    manufacturing. Define conversion
                                                                  91
92
            2. Data Staging Component
► After data is extracted, data is to be prepared
► Data extracted from sources needs to be changed,
  converted and made ready in suitable format
► Three major functions to make data ready
     Extract
     Transform
     Load
►   Staging area provides a place and area with a set
    of functions to
       Clean
       Change
       Combine
       Convert
                                                        93
94
        3. Data Storage Component
► Separate repository
► Data structured for efficient processing
► Redundancy is increased
► Updated after specific periods
► Only read-only
                                             95
96
     4. Information Delivery Component
1.   The novice user - prefabricated reports and
     preset queries.
2.   The casual user - prepackaged information.
3.   The business analyst - complex analysis
     using information
4.   The power user- navigate throughout the
     data warehouse, pick up interesting data,
     format his or her own queries, drill through
     the data layers, and create custom reports
     and ad hoc queries.
    4. Information Delivery Component
• Ad hoc reports - novice and casual users.
• complex queries, multidimensional (MD) analysis,
  and statistical analysis -business analysts and
  power users.
• EIS feed - senior executives and high-level
  managers.
• Data-mining - to discover trends and patterns
  from the usage of your data.
► Authentication issues
► Active monitoring services
   Performance, DBA note selected aggregates to
    change storage
   User performance
   Aggregate awareness
   E.g. mining, OLAP etc
                                              100
101
              The Metadata
► The  name suggests some high-level technological
  concept, but it really is fairly simple. Metadata is
  “data about data”.
► With the emergence of the data warehouse as a
  decision support structure, the metadata are
  considered as much a resource as the business
  data they describe.
► Metadata are abstractions -- they are high level
  data that provide concise descriptions of lower-
  level data.
103
                    Metadata
► Information   about data ware house ,location,
  Structure
► Information regarding refreshment of warehouse
  cleanup
► Information regarding security authentication and
  usage statistics
► Information regarding characteristics of components
                                                   104
       Significance Of meta data
► It act as glue that connect all part of data
  warehouse.
► It provide information about the contain and
  structure to the developers
► It open the door to the end users and
  makes the contents recognizable in their
  own terms.
                                            105
               The Metadata
                                           107
          General Metadata Issues
General metadata issues associated with Data
  Warehouse use:
     What tables, attributes and keys does the DW contain?
     Where did each set of data come from?
     What transformations were applied with cleansing?
     How have the metadata changed over time?
     How often do the data get reloaded?
     Are there so many data elements that you need to be
      careful what you ask for?
          Management and Control
               Component
•   It coordinates the services and activities within the data
    warehouse.
•   This component controls the data transformation and the data
    transfer into the data warehouse storage.
•   On the other hand, it moderates the information delivery to
    the users.
•   It works with the database management systems and enables
    data to be properly stored in the repositories.
•   It monitors the movement of data into the staging area and
    from there into the data warehouse storage itself.
•   This interacts with the metadata component to perform the
    management and control functions.
•   Metadata is the source of information for the management
    module
110
111
                      Putting the pieces together
                    
        Semistructured                                     MOLAP
           Sources                                                     Query/Reporting
                    www data
                                    Meta
                                    Data                                   
                  Extract
                                    Data                                                  
 
                                                                           Analysis
                                                                          
  
  
 
        Archived
                    Transform
                    Load          Warehouse                                              
          data
                    (ETL)                                  ROLAP                    Business
 IT                                                                      Data Mining
                                                                                     Users
Users
        Operational
        Data Bases                                                  
                                                                           
        Data sources                       Data Marts                      Tools
                                                              Business Users
112