Data Warehouse Basics & Models
Data Warehouse Basics & Models
   Definition:-
        “A data warehouse is a subject-oriented, integrated, time-variant, and
        nonvolatile collection of data which is used for decision-making
        process.”— W. H. Inmon
                                                                 G.Karuna, GRIET
April 14, 2022                                                                     1
                      Data Warehouse - Introduction
       Subject oriented :-
       Organized around major subjects, such as customer, product, sales,
        item etc.
        Focusing on the modeling and analysis of data for decision makers,
        not on daily operations or transaction processing.
        Provide a simple and concise view around particular subject.
       Integrated:-
       Constructed by integrating multiple, heterogeneous data sources like
        relational databases, flat files, on-line transaction records and put in
        a consistent format.
       Time-variant:-
       The Data are stored to provide information from a historical
        perspective.
        Implicitly or explicitly the key structure in DWH contains an element
        of time.
       Non-volatile :-
       DWH is always a physically separate store of data, transformed from
        application data found in operational environment.
        Operational update of data does not occur in the data warehouse
        environment. Only we can extract data, but we do not modify.
                                                                G.Karuna, GRIET
April 14, 2022                                                                    3
           Data Warehouse vs. Operational DBMS
       OLTP Systems:-
                Major task – to perform on line transaction, Query processing
                Day-to-day operations: purchasing, inventory, banking,
                 manufacturing, payroll, registration, accounting, etc.
       OLAP Systems:-
                Major task – to perform data analysis & decision making for
                 knowledge workers.
                Data organized in different formats.
       Distinct features (OLTP vs. OLAP):-
                User and system orientation: customer vs. market
                Data contents: current, detailed vs. historical, consolidated
                Database design: ER + application vs. star + subject
                Access patterns: update vs. read-only but complex queries
April 14, 2022                                                      G.Karuna, GRIET   4
                                       OLTP vs. OLAP
        S.No.        Feature                        OLTP                                      OLAP
    1            Definition       Online Transaction Processing                Online analytical Processing
    2            Characteristic   Operational processing                       Informational processing
    3            Orientation      Transaction                                  Analysis
    4            users            DBA, DB professional                         Knowledge workers (Manager)
    5            Function         Day to day opearation                        Historical (or) Decision Support
    6            DB Design        ER based Design &                            Star/Snowflake schemas &
                                  Application oriented                         Subject-oriented
    7            Data             Dynamic, current, relational, detailed, up   Static, historical, summarized,
                                  to date                                      consolidated, multidimentional
    8            View             Detailed view                                Summarized view
    9            Unit of work     Short & simple transaction                   Complex queries required
    10           Access           Read / Write                                 Mostly read
    11           No. Records      Tens/Hundreds                                Millions
    12           Users            Thousands                                    Hundreds
    13           Db size          100MB to GB                                  100 GB to TB
    14           Priority         High performance                             High flexibility
    15           Metrics          Transaction throughput                       Query throughput
    16           Focus            data in                                      Information out
    17           Operations       Indexing, hashing                            Lots of scans for analysis
April 14, 2022                                                                                 G.Karuna, GRIET    5
                  Why Separate Data Warehouse?
       To promote High performance for both systems.
                DBMS — tuned for OLTP: access methods, indexing, searching, concurrency
                 control, recovery
                Warehouse—tuned for OLAP: complex OLAP queries, computations,
                 multidimensional view, consolidation.
                If we use OLAP in operational DB – it degrades performance
                OLTP supports concurrency & recovery - if these applied on DWH it reduces
                 the throughput.
                DSS require historical data but operational DB do not maintain.
                DSS requires consolidation of data from heterogeneous sources but
                 operational contains only raw data.
                 Two systems support quite different functionalities. Thus we need
                 separate DWH.
                Many vendors trying to optimize OLTP db, so that they support OLAP
                 in future.
                                                                          G.Karuna, GRIET
April 14, 2022                                                                               6
       From Tables and Spreadsheets to Data Cubes
       A data warehouse & OLAP tools are based on a multidimensional data
        model which views data in the form of a data cube.
       A data cube, allows data to be modeled and viewed in multiple
        dimensions. i.e data cube is defined by dimensions and facts.
                Dimension tables, such as item (item_name, brand, type), or
                 time(day, week, month, quarter, year)
                Fact table contains measures (such as dollars_sold) and keys to
                 each of the related dimension tables.
       The physical structure of DWH is a data cube.
        Data cube provides multidimensional view and allows pre-computation
        and fast accessing of consolidated data.
                                                                  G.Karuna, GRIET
April 14, 2022                                                                      7
       From Tables and Spreadsheets to Data Cubes
   Ex:- All electronics company may create DWH to keep records of sales
          with respect to 4 dimensions item, time, supplier, location.
       Each dimension may associated with a table – Dimension table.
         Ex:- item is dimension (item-name, type, ….)
       A data cube organized around central theme – Fact table.
         Ex:- sales (dollas_sold, units_sold)
       In DWH, data cube is n-dimensional and it is combination of
        dimensions and fact tables.
        Before multi-dimensional model we start with 2D cube.
                                                               G.Karuna, GRIET
April 14, 2022                                                                   8
       From Tables and Spreadsheets to Data Cubes
       In data warehousing literature, a data cube is also referred as a
        cuboid.
        Given set of dimensions we can generate a cubiod for each of
        possible subsets. The result would form a lattice of cuboids which
        shows the data at different levels of summarization (or) group by.
       The resultant lattice of cuboids is called as n-Dimensional Data cube.
       The cuboid that holds lowest level summarization is called a base
        cuboid. The top most is 0-D cuboid, which holds the highest-level of
        summarization, is called the apex cuboid. The lattice of cuboids
        forms a data cube.
                                                              G.Karuna, GRIET
April 14, 2022                                                                   9
                     Cube: A Lattice of Cuboids
                                        all
                                                                                              0-D(apex) cuboid
                                      time,location,supplier
                                                                                              3-D cuboids
time,item,location
                        time,item,supplier            item,location,supplier
                                                                                       G.Karuna, GRIET
April 14, 2022                                                                                               10
         Conceptual Modeling of Data Warehouses
       Modeling data warehouses: dimensions & measures
                Star schema:
                Popular and commonly used model.
                DWH contains a large central table containing of the
                 bulk of data with no redundancy called as fact table.
                A set of small attendant tables one for each dimension
                 called as dimension tables (de-normalized tables).
                This schema looks like a star burst with central fact and
                 surrounding with dimension tables.
                The fact table contains key attributes of Dimensions.
April 14, 2022                                               G.Karuna, GRIET   11
                   Example of Star Schema
  time
  time_key                                             item
  day                                                item_key
  day_of_the_week               Sales Fact Table     item_name
  month                                              brand
  quarter                                time_key    type
  year                                               supplier_type
                                         item_key
                                       branch_key
          branch                                     location
                                      location_key
           branch_key                                location_key
           branch_name                  units_sold   street
           branch_type                               city
                                      dollars_sold   state_or_province
                                                     country
                                        avg_sales
                     Measures
                                                       G.Karuna, GRIET
April 14, 2022                                                           12
         Conceptual Modeling of Data Warehouses
                Snowflake schema:
                A refinement of star schema.
                In this some dimension tables are normalized into a
                 set of smaller dimension tables further. i.e. splitting
                 the data into additional tables.
                Single fact table with multiple normalized dimension
                 tables forming a shape similar to snowflake.
                To reduce redundancy, this model kept as normalized
                 form.
                                                              G.Karuna, GRIET
April 14, 2022                                                                  13
                 Example of Snowflake Schema
  time
  time_key                                           item
  day                                              item_key             supplier
  day_of_the_week             Sales Fact Table     item_name            supplier_key
  month                                            brand                supplier_type
  quarter                             time_key     type
  year                                 item_key    supplier_key
                                     branch_key
                                                   location
      branch                        location_key
                                                   location_key
        branch_key
                                      units_sold   street
        branch_name
                                                   city_key
        branch_type
                                    dollars_sold                     city
                                                                     city_key
                                      avg_sales                      city
                                                                     state_or_province
                   Measures                                          country
                                                              G.Karuna, GRIET
April 14, 2022                                                                       14
         Conceptual Modeling of Data Warehouses
       Fact constellation schema:
         Sophisticated applications may require multiple fact
        tables to share multiple dimension tables.
       Multiple fact tables with multiple dimension tables,
        viewed as a collection of stars, therefore called galaxy
        schema or fact constellation.
                                                       G.Karuna, GRIET
April 14, 2022                                                           15
                   Example of Fact Constellation
 time
 time_key                                         item                Shipping Fact Table
 day                                           item_key
 day_of_the_week            Sales Fact Table   item_name                  time_key
 month                                         brand
 quarter                        time_key       type                         item_key
 year                                          supplier_type              shipper_key
                                   item_key
                                 branch_key                             from_location
                                                   G.Karuna, GRIET
April 14, 2022                                                       17
                 Defining Star Schema in DMQL
                                                                G.Karuna, GRIET
April 14, 2022                                                                    19
      Defining Fact Constellation in DMQL
     define cube sales [time, item, branch, location]:
              dollars_sold = sum(sales_in_dollars), avg_sales =
                avg(sales_in_dollars), units_sold = count(*)
     define dimension time as (time_key, day, day_of_week, month, quarter, year)
     define dimension item as (item_key, item_name, brand, type, supplier_type)
     define dimension branch as (branch_key, branch_name, branch_type)
     define dimension location as (location_key, street, city, province_or_state,
        country)
     define cube shipping [time, item, shipper, from_location, to_location]:
              dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
     define dimension time as time in cube sales
     define dimension item as item in cube sales
     define dimension shipper as (shipper_key, shipper_name, location as location
        in cube sales, shipper_type)
     define dimension from_location as location in cube sales
     define dimension to_location as location in cube sales
                                                                 G.Karuna, GRIET
April 14, 2022                                                                      20
         Measures of Data Cube: Three Categories
      A data cube measures - kind of aggregate function used
  (i) Distributive: if the result derived by applying the function
     to n aggregate values is the same as that derived by
     applying the function on all the data without partitioning
                 E.g., count(), sum(), min(), max()
  (ii) Algebraic: if it can be computed by an algebraic function
      with M arguments (where M is a bounded integer), each of
      which is obtained by applying a distributive aggregate
      function
                 E.g., avg()
  (iii) Holistic: if there is no constant bound on the storage size
      needed to describe a subaggregate.
April 14, 2022
                 E.g., median(), mode(), rank()
                                                     G.Karuna, GRIET
                                                                        21
                 OLAP Operations in Data Cube
     OLAP operations allow user-friendly        environment          for
      interactive data analysis of Data Cube.
     A number of OLAP operations existed to provide flexibility to
      view the data in different perspective.
     Ex:- Consider All Electronics Sales. The data cube contains
      item, time, and location dimensions. Time is aggregated to
      Quarters and location is aggregated to city values. The
      measure used is dollars-sold.
                                                    G.Karuna, GRIET
April 14, 2022                                                          22
                        OLAP Operations
 1. Slice:- Performs a selection on one dimension of given data cube,
    resulting sub-cube.
              Ex:- slice for time = “Q1”
 2. Dice:- Performs a selection on two/more dimensions of given data cube,
    resulting sub-cube.
          Ex:- dice for (location = “Hyd” or “Bang”) and (time = “Q1” or “Q2”)
               and (item=“LCD” or “comp”)
 3. Roll up (drill-up): Performs aggregation on data cube either by climbing
   up a concept hierarchy for a dimension or dimension reduction.
             Ex:- roll-up on location (from cities to countries)
 4. Drill down (roll down): reverse of roll-up, either by stepping down a
    concept hierarchy for a dimension or introducing additional dimensions.
             Ex:- drill-down on time (from quarters to months)
                                                            G.Karuna, GRIET
April 14, 2022                                                                23
                         OLAP Operations
  5. Pivot:- is a visualization operation that rotates the data axes in view in
  order to provide an alternative presentation of data.
               Ex:- pivot on item and location ( these axes re rotated)
        table
      Drill through: through the bottom level of the cube to its back-end
                                                               G.Karuna, GRIET
April 14, 2022                                                                    24
                 OLAP Operations
                                                               G.Karuna, GRIET
    April 14, 2022                                                               29
      Data Warehouse Back-End Tools and Utilities
      Data extraction
         get data from multiple, heterogeneous, and external
          sources
      Data cleaning
         detect errors in the data and rectify them when possible
      Data transformation
         convert data from legacy or host format to warehouse
          format
      Load
         sort, summarize, consolidate, compute views, check
          warehouse
                                                       G.Karuna, GRIET
April 14, 2022                                                           30
                          Metadata Repository
      Meta data is the data defining warehouse objects. It stores:
      Description of the structure of the data warehouse
            schema, view, dimensions, hierarchies, derived data defn, data mart
             locations and contents
      Operational meta-data
            data lineage (history of migrated data and transformation path),
             currency of data (active, archived, or purged), monitoring
             information (warehouse usage statistics, error reports, audit trails)
      The algorithms used for summarization
      The mapping from operational environment to the data warehouse
      Data related to system performance
         warehouse schema, view and derived data definitions
      Business data
            business terms and definitions, ownership of data, charging policies
April 14, 2022                                                     G.Karuna, GRIET   31
                 Three tier DWH Architecture
                                                               G.Karuna, GRIET
April 14, 2022                                                                   34
          A Recommended Approach for DWH Development
     Top-down – serves a systematic solution & minimizes integration
      problems. But it is expensive, lack of flexibility.
  Bottom-up – provides flexibility, low cost and rapid development but
      integration is difficult.
      In recommended Approach,
  (i) High level corporate model is defined (within short period) that provide
      corporate, consistent and integrated view of data among various
      subjects. This need to refined further development of enterprise DWH/
      data mart.
 (ii) Data marts can be implemented parallel with DWH, based on same
      corporate model set.
 (iii) Distributed data marts can be constructed to integrate different DM’s.
 (iv) A multitier DWH is constructed where the enterprise is sole custodian
      of all WH data then it is distributed to various dependent data marts.
April 14, 2022                                                G.Karuna, GRIET   35
                  Data Warehouse Development:
                   A Recommended Approach
                                                    Multi-Tier Data
                                                    Warehouse
                        Distributed
                        Data Marts
          Hence, the join index records can identify joinable tuples. Join indexing
      is especially useful for maintaining the relationship between a foreign key
      and its matching primary keys, from the joinable relation.
          In DWH, join indexing is useful for cross table search because of star
      schema model of DWH.
Cube Materialization:
(i) Base Cell:- A cell in base cuboid is base cell.
(ii) Aggregate Cell:- A cell in non-base cuboid. Each aggregated
    dimension indicated as ‘*’.
Ex:- If we take 3D-cube all 1-D, 2-D cells are aggregate cells and
    3-D is Base cell.
(iii) Ancestor & Descendent cells:- 1-D and 2-D cells are ancestors
    of 3-D cell. 3-D is descendent cell.
                                                              G.Karuna, GRIET   44
         Cube Materialization:
       Full Cube vs. Iceberg Cube
   Full cube:- Computation all the cells of all the
    cuboids in data cube.
                                                     G.Karuna, GRIET   46
             Multi-way Array Aggregation for Cube
                    Computation (MOLAP)
        Partition arrays into chunks (a small subcube which fits in memory).
        Compressed sparse array addressing: (chunk_id, offset)
        Compute aggregates in “multiway” by visiting cube cells in the order which
         minimizes the # of times to visit each cell, and reduces memory access and
         storage cost.
        C    c3 61
           c2 45
                       62     63     64
                     46     47     48
         c1 29    30     31     32                           What is the best
        c0
        b3   B13   14      15     16         60              traversing order
                                           44
              9
                                        28 56                to do multi-way
        b2
B                                          40
                                        24 52                aggregation?
        b1    5                           36
                                        20
        b0    1    2        3     4
             a0    a1      a2    a3
                       A                                                   G.Karuna, GRIET   47
          Multi-Way Array Aggregation
                                                    All
   Array-based “bottom-up” algorithm
   Using multi-dimensional chunks
                                           A              B         C
   No direct tuple comparisons
   Simultaneous aggregation on multiple       AB          AC               BC
    dimensions.
   The best order is the one that                               ABC
                                                          G.Karuna, GRIET        48
                 Data Mining: Concepts and
April 14, 2022          Techniques           49
        Multi-Way Array Aggregation for Cube
                   Computation
   Method: the planes should be sorted and computed
    according to their size in ascending order
       Idea: keep the smallest plane in the main memory,
        fetch and compute only one chunk at a time for the
        largest plane
   Limitation of the method: computing well only for a small
    number of dimensions
       If there are a large number of dimensions, “top-down”
        computation and iceberg cube computation methods
        can be explored.
                                                     G.Karuna, GRIET   50
        Star-Cubing: An Integrating Method
   D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes
    by Top-Down and Bottom-Up Integration, VLDB'03
   Explore shared dimensions
       E.g., dimension A is the shared dimension of ACD and AD
       ABD/AB means cuboid ABD has shared dimensions AB
   Allows for shared computations
       e.g., cuboid AB is computed simultaneously as ABD
                                                                                    C /C          D
   Aggregate in a top-down
    manner but with the bottom-up
                                            A C /A C        A D /A       B C /B C       B D /B        CD
    sub-layer underneath which will
    allow Apriori pruning                                              A C D /A
                                  A B C /A B C   A B D /A B                                 BCD
   Shared dimensions grow in
    bottom-up fashion                                  A B C D /a ll
                                                                                    G.Karuna, GRIET    51
                Further Development of Data Cube
                               G.Karuna, GRIET
              Complex Aggregation at Multiple
             Granularities: Multi-Feature Cubes
      Multi-feature cubes : Compute complex queries involving multiple
       dependent aggregates at multiple granularities.
      Many complex data mining queries can be answered by multi feature
       cubes without any significant increase in computational cost.
      Ex. Grouping by all subsets of {item, region, month}, find the
       maximum price in 1997 for each group, and the total sales among all
       maximum price tuples
           select item, region, month, max(price), sum(R.sales)
           from purchases
           where year = 1997
           cube by item, region, month: R
           such that R.price = max(price)
              significance constraint
              probe constraint
              gradient constraint
      Both OLAM and OLAP servers accept user queries via GUI API and
      work with data cube via cube API.
      OLAM server may perform multiple data mining tasks. This consisting
      of integrated DM modules and these are more sophisticated than
      OLAP server.
    April 14, 2022                                           G.Karuna, GRIET   61
                  An OLAM System Architecture
     Mining query                                  Mining result             Layer4
                                                                       User Interface
                             User GUI API
                                                                             Layer3
                 OLAM                              OLAP
                 Engine                            Engine              OLAP/OLAM
                                                                             Layer2
                                MDDB
                                                                             MDDB
                                                   Meta
                                                   Data
     Filtering&Integration    Database API         Filtering
                                                                             Layer1
                               Data cleaning     Data
                 Databases                                                 Data
                              Data integration Warehouse                 Repository
April 14, 2022                                             G.Karuna, GRIET            62
                      Data Generalization
       Data Generalization is a process that abstracts a large set of
        task-relevant data in a DB/DWH from a relatively low-level
        concepts to high level concepts.
                                                    G.Karuna, GRIET
                                                                      65
    Attribute-Oriented Induction: An Example
   Step 2. Fetch task relevant set of data using an SQL statement.
    i.e. transform into SQL statement.
        Use Big_Univ_DB
        Select name, gender, major, birth_place, birth_date,
         residence, phone#, gpa
        from student
        where status in {“M.Sc.”, “MBA”, “B.Sc.”, “B.A.”,
         ”B.Com” ,”M.Com”,”MCA”, “MBBS” }.
           Now the data is stored in a table called as initial working
            relation (IWR).
                                                           G.Karuna, GRIET   66
    Attribute-Oriented Induction: An Example
   Step 3. Perform data generalization using 2 methods attribute
    removal and attribute generalization.
   Attribute-removal: remove attribute A if there is a large set of
    distinct values for A but (1) there is no generalization operator
    on A, or (2) A’s higher level concepts are expressed in terms of
    other attributes
   Attribute-generalization: If there is a large set of distinct values
    for A, and there exists a set of generalization operators on A,
    then select an operator and generalize A .
   Attribute-threshold control: threshold default range 2-8
   Generalized relation threshold control: control the final
    relation/rule size. The default is 10-30.
                                                          G.Karuna, GRIET   67
    Attribute-Oriented Induction: An Example
   Name - remove attribute because of distinct values
                                                         68
                Class Characterization: An Example
                                       Birth_Region
                                                      Canada      Foreign          Total
                                Gender
                                           M             16          14             30
                                           F             10          22             32
                                        Total            26          36             62
69