Welcome to course
CS408: Data Warehousing
Outlines
•   About Me
•   Course information
•   Teaching team
•   Learning outcomes
•   Course contents and schedule
•   Assessments
•   Resources
•   Lecture 01
                                   2
Course Information
  Course Title:    Data Warehousing
  Course Code:     CS408
                   Database Systems, Data
  Prerequisites:
                   Structures, Operating Systems
  Level:           4
                                                   8
Why should you be here?
• Bad decisions can lead to disaster
   – Data Warehousing is at
   the base of decision
   support systems
    Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   10
Why should you be here?
• Data Warehousing & OLAP are important
• They help to
   – Understand the information hidden
     within an organization’s data
       • See data from different angles:
         product,client,time,geographical area
       • Get adequate statistics to get
         your point of argumentation
         across
       • Get a glimpse of the future…
   Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   11
Why should you be here?
                          12
 Course Contents and Schedule
Week                   LECTURE                                                         LAB
       Introduction to course, Introduction to Data
 1                                                  •   Introduction with lab course, Introduction with Tool
                       Warehouses
                                                    •   Database and SQL Revision
 2        DW Life Cycle and Basic Architecture
                                                    •   Lab task based on SQL queries
                                                    •   PSQL introduction and lab tasks based on PSQL
 3                Storage Architecture
                                                    •   Lab task based on PSQL
                                                    •   Will follow a Data set in in our labs
 4          DW Modeling – Conceptual Model          •   In this step we will design data warehouse architecture
                                                    •   Create staging Database in SQL Server 2012.
 5                   Logical Model                 •    Create Production Data warehouse in SQL Server 2012.
                                                   •    Data will be extracted from source data and load in staging tables
 6                  Physical Model
                                                        using SSIS
                                                   •    Different transformations will be applied on the data and data will
 7                    Project Brief                     stored in production tables using SSIS and implementation of slowly
                                                        changing dimension.
                        Indexes                    •    OLAP queries writing
 8
                                                   •    Query based tasks will be assigned to perform with in the lab.
                  Indexes (continued)              •    OLAP Query Optimizing
 9
                                                   •    Query based tasks will be assigned to perform with in the lab.
 10                  Optimisation                  •    Multidimensional cube will be design using SSAS
 11             Optimisation (continued)           •    Formulas, aggregation applied on the built cube
 12                  OLAP Queries                  •    Data visualization and reporting based on Datawarehouse
 13           OLAP Queries (continued)             •    Data visualization and reporting based on Cube
 14                 Building the DW                •    Reports and Dashboard management on Powerbi.com
             Real-time Data Warehousing            •    Introduction to Hadoop and other Big Data tools and techniques
 15
                 Project submission                •    DWH lab revision
 16    Big Data – Challenges and Opportunities     •    Semester Project discussion                                      13
Assessments
    Assessment type   Marks weight
       Lab Tasks          20%
         Mid              15%
         Final
                          35%
        Project           30%
                                     14
Recommended Literature
• Building the Data Warehouse
  – William H. Inmon
  – Wiley, ISBN 978-0-7645-9944-6
• The Data Warehouse Toolkit
  – Ralph Kimball & Margy Ross
  – Wiley, ISBN 978-1-118-53080-1
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   15
          Lecture 01
Introduction to Data Warehouse
                                 16
This week
1.What is a data warehouse?
2. Applications and users
    Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   17
What is a data warehouse?
• Basically a very large da ta ba se…
  – Not all very large databases are data warehouses,
    but all data warehouses are pretty large databases
  – Nowadays a warehouse is considered
    to start at around 800 GB and goes
    up to severalTB
  – It spans over several servers and
    needs an impressive amount of
    computing power
   Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   18
What is a data warehouse?
• More specific, a collective data repository
  – Containing snapshots of the operational data
    (history)
  – Obtained through data cleansing (Extract-
    Transform- Load)
  – Useful for analytics
    Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   19
What is a data warehouse?
• Compared to other solutions it…
  – Is suitable for tactical/strateg ic focus
  – Implies a sm all num ber of transactions
  – Implies large transactions spanning over a
    long period of time
   Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   20
Some Definitions
• Experts say…
   – Ralph Kimball:“a copy of transaction
     data specifically structured for query and
     analysis”
   – Bill Inmon: “A data warehouse is a:
                – Subject oriented
                – Integrated
                – Non-volatile
                – Time variant
          collection of data in support of management’s decisions.”
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   21
Inmon Definition (cont’d.)
• S ubject oriented
  – The data in the data warehouse is organized so that
    all the data elements relating to the same real-world
    event or object are linked together
         • Typical subject areas in DWs are
           Customer,Product,Order,Claim,
           Account,…
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   22
Inmon Definition (cont’d.)
• S ubject oriented
  – Example:customer as subject in a DW
        • DW is organized in this case by the customer
        • It may consist of 10,100 or more physical tables, all
          related
                                                                         CUSTOMER
                                                                        Base customer
                                                                       Data 2003 - 2005
                                                                                                 Customer activity
                                             Base customer
                                                                                                   2001 - 2004
                                            Data 2000 - 2002
                                                           Customer activity        Customer activity
                                                           Detail 2002 - 2004       Detail 2005 - 2006
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides                   23
Inmon Definition (cont’d.)
• Integ ra ted
  – The data warehouse contains data from most or all
    of an organization's operational systems and this data is
    made consistent
  – E.g. gender,measurement,conflicting keys,
    consistency,…
                             operational                  encoding                     DW
                             appl A – M,F                                              M,F
                             appl B – 1,0
                             appl C – male,female
                             appl A – cm                                                cm
                             appl B – inches
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   24
Inmon Definition (cont’d.)
• Non-vola tile
   – Data in the data warehouse is never over-written
     or deleted - once committed, the data is static, read-
     only, and retained for future reporting
   – Data is loaded,but not updated
   – When subsequent changes occur,a new snapshot record is
     written
                    Operational                                                          DW
      insert                             change
     delete                              access
     access                                                            load
                  Record-by-record                                                   Mass load/
                    manipulation                                                    access of data   access
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides            25
 Inmon Definition (cont’d.)
• T im e-va rying
   – The changes to the data in the data warehouse are
     tracked and recorded so that reports can be produced
     showing changes over tim e
   – Different environments have different tim e horizons
  • associated
      • While for operational systems a 60-
        to-90 day time horizon is normal,
        data warehouse has a
     • 5-to-10 year horizon
     Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   26
General Definition
• More general,a DW is a
  – Repository of an
    org anization’s
    electronically stored
    data
  – Desig ned to
    facilitate
    reporting and
    analysis
   Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   27
 Typical Features
• DW typically…
  – Reside on computers dedicated to this function
  – Run on DB MS such as Oracle, IBM DB2, Teradata or
    Microsoft SQL Server
  – Retain data for long periods of tim e
  – Consolidate data obtained from a variety of
    sources
  – Are built around their own carefully desig ned data
    m odel
    Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   28
 Use case
• DW stands for big data volume,so lets take an
  example of 2 big com pa nies,a retailer,say
  Walmart and a RDBMS vendor,Sybase:
  – Walmart CEO:I want to keep track of sales in
    all my stores simultaneously
  – Sybase consultant:You need our wonderful RDBMS software.
    You can stuff data in as sales are rung up at cash registers and
    simultaneously query data right in your office
  – SoWalmart buys a $1 milion Sun E10000 multi-CPU
    server,a $500 000 Sybase license,a book
    “Database Design for Smarties”,and build
    themselves a normalized SQL data model
    Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   29
Use case (cont’d.)
• After a few months of stuffing data into the tables…
  aWalmart executive asks…
  – I have noticed that there was a Colgate promotion recently,
    directed to people who live in small towns.How much
    toothpaste did we sell in those towns yesterday?
  – Translation to a query:
     select sum(sales.quantity_sold) from sales,products,product_categories, manufacturers,
     stores,cities where manufacturer_name =‘Colgate’
     and product_category_name =‘toothpaste’ and cities.population < 40 000
     and trunc(sales.date_time_of_sale) = trunc(sysdate-1) and sales.product_id =
     products.product_id
     and sales.store_id = stores.store_id
     and products.product_category_id = product_categories.product_category_id and
     products.manufacturer_id = manufacturers.manufacturer_id
     and stores.city_id = cities.city_id
   Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   30
Use case (cont’d.)
– The tables contain large volumes of data and the
  query implies a 6 way join so it will take some
  time to execute
– The tables are at the sam e tim e also updated
  by new sales
– Soon after executive start their quest for marketing
  information store employees notice that there are times
  during the day when it is impossible to process a sale
               Any attempt to update the database results in freezing the
                 computer up for 20 minutes
   Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   31
Use case (cont’d.)
• In minutes…theWalmart CEO calls Sybase
  tech support
                  • Walmart CEO: WE TYPE IN THE TOOTHPASTE
                     QUERY AND OUR SYSTEM HANGS!!!
                  • Sybase support: Of course it does! You built an on-
                     line transaction processing (OLTP) system. You
                     can’t feed it a decision support system (DSS)
                     query and expect things to work!
   – Walmart CEO:!@%$#.I thought this was the whole point of SQL
     and your RDBMS…to query and insert simultaneously!!
   – Sybase support:Uh,not exactly.If you’re reading from the database,
     nobody can write to the database.If you’re writing to the database,
     nobody can read from the database.So if you’ve got a query that takes
     20 minutes to run and don’t specify special locking instructions,
     nobody can update those tables for 20 minutes.
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   32
Use case (cont’d.)
  – Walmart CEO: It sounds like a bug.
  – Sybase support: Actually it is a feature. We call it
    pessimistic locking.
  – Walmart CEO: Can you fix your system so that it doesn’t
    lock up???
  – Sybase support: No. But we made this great loader tool
    so that you can copy everything from your OLTP system
    into a separate Data Warehouse system at 100 GB/hour
• After a while…
   Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   33
OLTP vs. DW
• OLTP (OnL ine T ransaction P rocessing):
  – Also known under the name of operational data,
    it represents day-to-day operational business activities:
         • Purchasing,sales,production distribution, …
  – Typically for data entr y and retrieval
    transaction processing
  – Reflects only the current state
    of the data
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   34
OLTP vs. DW (cont’d.)
• OLAP (OnL ine A nalytical P rocessing):
  – Represents front-end analytics based on a
    DW repository
  – It provides information for activities like
        • Resource planning,capital budgeting,marketing initiatives,…
  – It is decision oriented
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   35
OLTP vs. DW (cont’d.)
• Properties
 Operational DB                                                 DW
 Mostly updates                                                 Mostly reads
 Many small transactions                                        Queries long, complex
 MB-TB of data                                                  GB-PB of data
 Raw data                                                       Summarized data
 Clerical users                                                 Decision makers
 Up-to-date data                                                May be slightly outdated
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   36
OLTP vs. DW (cont’d.)
• Consider a norm a lized da ta ba se for a store
  – The tables would look like this…
                                              Shipment
                                               Invoice                                   Invoice
                                                                                          Invoice_line_item
 Customer
 Customer_ID (PK)                              Invoice_number (PK)                        Invoice_number (FK)
 Name                                          Date                                       Item_seq_number
 Address                                       Customer_ID (FK)                           Product_ID (FK)
 City                                          Status_code (FK)                           Units
 Postal_code                                   Total                                      Unit_cost
                                               Sales_tax
                                               Shipping_charge
                                                                                            Sales_unit
                                                                                            Sales_unit_D
                                                                                            (PK) Name
                                                Product                                     Address
 Status                                                                                     City
                                                Product_ID (PK)                             Postal_cod
 Status_code (PK)                               Name                                        e
 Status                                         Description                                 Telephone_number
                                                Cost                                        Email
                                                Sales_unit_ID (FK)
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides              37
OLTP vs. DW (cont’d.)
• If we were to set up a DW for that store,we
  would start by building the following star schema
       Customer                                                                          Product
       Customer_ID (PK)                                                                  Product_ID (PK)
       Name                                                                              Name
       Address                                                                           Description
       City                                                                              Cost
       Postal_code                             Sales_facts
                                               Product_ID (FK)
                                               Customer_ID (FK)
                                               Time_key (FK)
                                               Sales_unit_ID (FK)
          Sales_unit                           Total                                      Time_period
                                               Sales_tax
          Sales_unit_ID (PK)                   Shipping_charge                            Time_key
          Name                                                                            (PK)
          Address City                                                                    Description
          Postal_code                                                                     Day
          Telephone                                                                       Fiscal_week
                                                                                          Fiscal_period
                                                                                          Fiscal_year
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides         38
OLTP vs. DW (cont’d.)
• B a sic insig hts from comparing OLTP and DWs
  – A DW is a separate (RDB MS ) installation that
    contains copies of data from on-line systems
        • Physically separate hardware may not be absolutely
          necessary if you have lots of extra com puting power,
          but it is recommended
  – With an optim istic locking DBMS you might even
    be able to get away for a while with keeping just one
    copy of your data
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   39
OLTP vs. DW (cont’d.)
• There is an essentially different pattern of
  ha rdwa re utiliz a tion between on-line
  and analytical processing
                 Operational                                                       Data warehouse
   Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   40
Applications of DW
• Typica l questions which can be answered
  with DW & OLAP
  – How much did sales unitA earn in January?
  – How much did sales unit B earn in February?
    What was their combined sales amount for the first
    quarter?
• Answering these questions with S QL -queries is
  difficult
  – Complex query formulation necessary
  – Process is likely to be slow due to complex joins and
    multiple scans
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   41
Applications of DW (cont’d.)
• Why such questions can be answered better with
  a DW?
  – Because in a DW tables are rearranged and
    pre- ag g reg ated (known as computing cubes)
    e.g. Years, weeks, days etc.
         • The tables arrangement is subject oriented,usually some
           sta r schem a
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   42
Applications of DW (cont’d.)
• A DW is the base repository for front-end
  a na lytics
  – OLAP
  – KDD (Knowledge Discovery in Databases) a
    data mining process
  – Data visualization
  – Reporting
   Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   43
Applications of DW (cont’d.)
• OLAP is a form of information processing and
  thus needs to provide timely,accurate and
  understandable information
   – timely is however a relative term:
          • In OLTP we expect an update to go
            through in a matter of seconds
          • In OLAP the time to answer a query
            can take m inutes, hours or even longer
• There are many flavors of OLAP
   – ROLAP, DOLAP, MOLAP, WOLAP, HOLAP,…
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   44
Applications of DW (cont’d.)
• KDD (Data Mining)
     – Constructs m odels of the data in question
           • Models can be viewed as high level summaries of the
             underlying data
ID      Name             ZIP        Sex      Age        Income          Children             Car         Spent
12       Peter         38106 M                35       € 55,000              2           Mini Van       € 210.00
15 Gabriel 38100 M                            32       € 56,000              0              SUV         € 30.00
…           …             …          …        …             …                …                …           …
122 Claire 38106 F                            21       € 42,000              0             Coupe        € 50.00
     Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides              45
Applications of DW (cont’d.)
– Based on this example a query returns the data that
  fulfills the constraints
   • SELECT * FROM CUSTOMER_TABLE WHERE
     TOTAL_SPENT > €100;
– Data mining might return the following set of rules
  for customers spending more than €100:
   • IF AGE > 35 AND CAR =‘MINIVAN’ THEN TOTAL SPENT
     > €100
   • IF SEX =‘M’ AND ZIP = 38106 THEN TOTAL SPENT >
     €100
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   46
Applications of DW (cont’d.)
 – It answers questions like
       • Which products or customers are more profitable
       • Which outlets have sold the least this year
 – In consequence it motivates decisions like
       • Which products should have their production increased
       • Which customers should be targeted for special
         promotions
       • Which outlets should be closed
 Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   47
Who is the user?
• Users of DW are called DS S a na lysts
  and usually are business persons
  – Their primary job is to define and discover
    information used in corporate decision-m aking
  – The way they think
       • “Give me what I say I want,and then I can tell you what I
         really want”
       • They work in explorative manner
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   48
Who is the user? (cont’d.)
 – Typical explorative line of work
     • “Ah! Now that I see what the possibilities are,I can tell
       what I really want to see.But until I know what the
       possibilities are,I cannot describe exactly what I want…”
 – This usage has fundamental effect on the way a DW
   is developed
     • The classical system developm ent life cycle
       assumes that the requirements are known at the start of
       design
     • The DSS analyst starts with existing requirements,but
       fa ctoring in new requirem ents with time
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   49
Summary
• Data Warehouse - Introduction
  – DW Definitions
  – OLTP vs. DW
  – Applications of DW
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   50
Next lecture
• Data Warehouse Life Cycle
  Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides   51