INTRODUCTION        TO      DATA      WAREHOUSING          AND     COMPONENTS OF DATA WAREHOUSE ARCHITECTURE
MANAGEMENT                                                         ETL (Extract, Transform, Load)
                                                                     − moving of data from a data source into data warehouse
Data Warehouse/Enterprise Data Warehouse (EDW)                       − converts data into a usable format so that once it’s in
− system that aggregates data from different                              the data warehouse, it can be analyzed/queried/etc.
    (heterogeneous) sources into a single, central, consistent
    data store to support data analysis, data mining, artificial   Metadata
    intelligence (AI), and machine learning.                          − data about data
− enables an organization to run powerful analytics on huge           − describes all of the data that are stored in a system to
    volumes (petabytes) of historical data in ways that a                 make it searchable (authors, dates or locations of an
    standard database cannot.                                             article, create date of a file, the size of a file, etc)
− core of the BI system which is built for data analysis and          − allows data to be organized to make it usable, so it
    reporting.                                                            can be analyzed to create dashboards and reports
KEY CHARACTERISTICS OF DATA WAREHOUSE                              SQL Query Processing
1. Subject-Oriented                                                    − SQL is the de facto standard language for querying
    - it provides topic-wise information rather than a                    data; the language that analysts use to pull out
    business's overall processes (sales, promotion, inventory,            insights from their data stored in the data warehouse.
    etc.)
    - ex: analyzing a company’s sales data needs a data            Data Layer
    warehouse that concentrates on sales.                              − the access layer that allows users to actually get to
                                                                           the data
2.   Integrated                                                        − typically where data mart is.
     - developed by integrating data from varied sources into a        − partitions segments of your data out depending on
     consistent format.                                                    who you want to give access to, so you can get very
     - the data must be stored in the warehouse in a consistent            granular across your organization
     and universally acceptable manner in terms of naming,
     format, and coding for effective analysis.                    Governance and Security
                                                                      − related to the data layer in that you need to be able
3.   Time-Variant                                                         to provide fine-grained access and security policies
     - data stored in a data warehouse is documented with an              across all of your organization’s data
     element of time (explicitly or implicitly)
     - ex: the Primary Key, which must have an element of time     DATA WAREHOUSE SYSTEM
     like the day, week, or month.                                 • Decision Support System (DSS)
                                                                   • Executive Information System
4.   Non-Volatile                                                  • Management Information System
     - data once entered into a data warehouse must remain         • Business Intelligence Solution
     unchanged thus all data is read-only.                         • Analytic Application
     - previous data is not erased when current data is entered    • Data Warehouse
     helping in analyzing what has happened and when.
                                                                   HOW DATA WAREHOUSE WORKS?
A SHORT HISTORY OF DATA WAREHOUSE ARCHITECTURE                        − It works as a central repository where information
  built around a relational database system, either on-                    arrives from one or more data sources.
     premise or in the cloud, where data is both stored and           − data flows into a data warehouse from the
     processed.                                                             transactional system and other relational databases
  other       components      would     include     metadata         − data is processed, transformed, and ingested so that
     management and an API connectivity layer allowing the                  users can access the processed data in the Data
     warehouse to pull data from organizational sources and                 Warehouse through Business Intelligence tools, SQL
     provide access to analytics and visualization tools.                   clients, and spreadsheets
  typical data warehouse has four main components: a                 − data warehouse merges information coming from
     central database, ETL tools, metadata, and access tools.               different sources into one comprehensive database
  born in the 1980s, it addressed the need for optimized                   so an organization can analyze its customers more
     analytics on data. As companies’ business applications                 holistically while considering all the information
     began to grow and generate/store more data, they                       available.
     needed a system that could both manage the data and              − data warehousing makes data mining possible.
     analyze it.                                                      − data may be:
  at a high level, database admins could pull data from            Structured - generally stored in tables in the form of rows
     their operational systems and add a schema to it via             and columns (relational data)
     transformation before loading it into their data                               Semi-Structured - organized up to some extent
     warehouse                                                                      only and the rest is unstructured (XML/RDF)
  metadata became important when data warehouse                    Unstructured - unprocessed and unorganized data (Text
     architecture evolved and grew in popularity, more                files, Emails, Media logs )
     people within a company started using it to access data–
                                                                      − data warehousing makes data mining (looking for
     and the data warehouse made it easy to do so with
                                                                            patterns in the data that may lead to higher sales and
     structured data
                                                                            profits) possible
  reporting and dashboarding became a key use case, and
     SQL (structured query language) became the de facto
     way of interacting with that data.
DATA WAREHOUSE VS. DATABASE, DATA LAKE, AND DATA                    TYPES OF DATA WAREHOUSES
MART                                                                Cloud Data Warehouse
                                                                        − specifically built to run in the cloud, and it is offered
Data Warehouse                                                              to customers as a managed service
 – gathers raw data from multiple sources into a central                − seek to reduce on-premises data center footprint
     repository, structured using predefined schemas                    − the physical data warehouse infrastructure is
     designed for data analytics                                            managed by the cloud company
Data Lake                                                           Data Warehouse Software (on-premise/license)
 – centralized repository designed to store, process, and               – business can purchase a data warehouse license and
     secure large amounts of structured, semi-structured, and              then deploy a data warehouse on their own on-
     unstructured data (without predefined schemas)                        premises infrastructure
 – can store data in its native format and process any variety          – expensive but might be a better choice for
     of it, ignoring size limits; compared to a hierarchical data          government entities, financial institutions, or other
     warehouse, which stores data in files or folders, a data              organizations that want more control over their data
     lake uses a flat architecture and object storage to store             or need to comply with strict security or data privacy
     the data                                                              standards or regulations
 – commonly built on big data platforms such as Apache              Data Warehouse Appliance
     Hadoop                                                             – pre-integrated bundle of hardware and software—
                                                                           CPUs, storage, operating system, and data warehouse
                                                                           software—that a business can connect to its network
        o   designed to capture raw data (structured, semi-                and start using as-is
            structured, and unstructured)                               – sits somewhere between cloud and on-premises
        o   made for large amounts of data                                 implementations in terms of upfront cost, speed of
        o   used for ML and AI in its current state or for                 deployment, ease of scalability, and management
            analytics with processing                                      control
        o   can organize and put into databases or DW
                                                                    BENEFITS OF A DATA WAREHOUSE
                                                                    1. Better data quality: A data warehouse centralizes data
Data Mart                                                              from a variety of data sources, such as transactional
 – subset of a data warehouse that contains data specific to           systems, operational databases, and flat files. It then
     a particular business line or department                          cleanses it, eliminates duplicates, and standardizes it to
 – enable a department or business line to discover more-              create a single source of the truth.
     focused insights more quickly than possible when               2. Faster, business insights: Data warehouses enable data
     working with the broader data warehouse data set                  integration, allowing business users to leverage all of a
 – built from an existing data warehouse (or other data                company’s data into each business decision.
     sources) through a complex procedure that involves             3. Smarter decision-making: Data warehouse supports large-
     multiple technologies and tools to design and construct           scale BI functions such as data mining (finding unseen
     a physical database, populate it with data, and set up            patterns and relationships in data), artificial intelligence,
     intricate access and management protocols                         and machine learning.
                                                                    4. Gaining and growing competitive advantage: All of the
Database                                                               above combine to help an organization finding more
 – an organized collection of structured information, or               opportunities.
     data, typically stored electronically in a computer system
 – usually controlled by a database management system               CHALLENGES WITH DATA WAREHOUSE ARCHITECTURE
     (DBMS)                                                         • companies start housing more data and needing more
 – used to store and manage large amounts of unstructured              advanced analytics and a wide range of data, the data
     data structured and , and they can be used to support a           warehouse starts to become expensive and not so flexible
     wide range of activities, including data storage, data         • open data lakehouse allows you to run warehouse
     analysis, and data management                                     workloads on all kinds of data in an open and flexible
                                                                       architecture (instead of a tightly coupled system it is much
  ❖ Relational Database                                                more flexible and also can manage unstructured and semi-
     o designed to capture and record data (OLTP)                      structured data like photos, videos, IoT data etc)
     o live, real-time data                                         • data lakehouse can also support your data science, ML
     o data stored in tables with rows and columns                     and AI workloads in addition to your reporting and
     o data is highly detailed                                         dashboarding workloads (upgrade from data warehouse
     o flexible schema (how data is organized)                         architecture, then developing an open data lakehouse is
                                                                       the way to go)
                                                                    TYPES OF DATA WAREHOUSE
                                                                    1. ENTERPRISE DATA WAREHOUSE (EDW)
                                                                            – a centralized warehouse that provides decision
                                                                               support service across the enterprise
                                                                            – also provide the ability to classify data according
                                                                               to the subject and give access according to
                                                                               divisions
                                                                    2.   OPERATIONAL DATA STORE
                                                                            – nothing but data store required when neither
                                                                                data warehouse nor OLTP systems support
                                                                                organizations reporting needs
                                                                 WHAT IS A DATA WAREHOUSE USED FOR?
3.   DATA MART
                                                                 ✓ Airline
        – subset of the data warehouse
                                                                 ✓ Banking
        – specially designed for a particular line of
                                                                 ✓ Healthcare
           business, such as sales, finance, sales or finance.
                                                                 ✓ Public Sector
           Data can collect directly from sources
                                                                 ✓ Investment and insurance sector
                                                                 ✓ Retain chain
STAGES OF USE OF THE DATA WAREHOUSE
                                                                 ✓ Telecommunication
➢ OFFLINE OPERATIONAL DATABASE
                                                                 ✓ Hospitality industry
    – data is just copied from an operational system to
        another system (loading, processing, and reporting
                                                                 STEPS TO IMPLEMENT DATA WAREHOUSE
        of the copied data do not impact the operational
                                                                 a) ENTERPRISE STRATEGY
        system’s performance)
                                                                     Identify technical including current architecture and tools
                                                                     as well as facts, dimensions, and attributes (data mapping
➢    OFFLINE DATA WAREHOUSE
                                                                     and transformation)
     – data in the data warehouse is regularly updated from
          the operational database
                                                                 b) PHASED DELIVERY
     – data in data warehouse is mapped and transformed
                                                                    Data warehouse implementation should be phased based
          to meet the data warehouse objectives
                                                                    on subject areas. Related business entities like booking
                                                                    and billing should be first implemented and then
➢    REAL TIME DATA WAREHOUSE
                                                                    integrated with each other
      – data warehouses are updated whenever any
         transaction takes place in operational database         c)   INTERATIVE PROTOTYPING
      – ex: Airline or railway booking system                         Data warehouse should be developed and tested
                                                                      interatively
➢    INTEGRATED DATA WAREHOUSE
      – data warehouses are updated continuously when
         the operational system performs a transaction (data     BEST PRACTICES TO IMPLEMENT A DATA WAREHOUSE
         warehouse then generates transactions which are         o The data warehouse must be well integrated, well defined
         passed back to the operational system)                      and time stamped.
                                                                 o While designing data warehouse make sure you use right
FOUR COMPONENTS OF DATA WAREHOUSE                                    tool, stick to life cycle, take care about data conflicts and
Load Manager (the front components)                                  ready to learn your mistakes.
     – It performs with all the operations associated with       o Never replace operational systems and reports.
        the extraction and load of data into the warehouse.      o Don’t spend too much time on extracting, cleaning and
     – It prepares the data for entering into the DW                 loading data.
                                                                 o Ensure to involve all stakeholders including business
Warehouse Manager                                                    personnel in data warehouse implementation process.
   – performs       operations    associated    with    the      o Prepare a training plan for the end users.
       management of the data in the warehouse.
   – It performs operations like analysis of data to ensure      ADVANTAGES OF DATA WAREHOUSE
       consistency, creation of indexes and views,               ▪ Allows business users to quickly access critical data from
       generation of denormalization and aggregations,              some sources all in one place.
       transformation and merging of source data and             ▪ Provides consistent information on various cross-
       archiving and back up data                                   functional activities.
                                                                 ▪ Helps to integrate many sources of data to reduce stress
Query Manager (backend component)                                   on the production systems.
   – performs all the operations            related   to   the   ▪ Helps reduce total turnaround time for analysis and
       management of user queries.                                  reporting.
                                                                 ▪ Restructuring and integration make it easier for the user
End-User Access Tools                                               to used for reporting and analysis.
    – categorized into five different groups:                    ▪ Stores a large amount of historical data which helps users
        1. Data reporting                                           to analyze different time periods and trends to make
        2. Query Tools                                              future predictions
        3. Application Development tools
        4. Executive Information System (EIS) tools              DISADVANTAGES OF DATA WAREHOUSE
        5. OLAP                                                   Not an ideal option for unstructured data.
                                                                  Creation and implementation of DW is surely time
WHO NEEDS DATA WAREHOUSE?                                            confusing affair.
• Decision makers who rely on mass amount of data                 Can be outdated relatively quickly.
• Users who use customized, complex processes to obtain           Difficult to make changes in data types and ranges, data
  information from multiple data sources                             source schema, indexes, and queries.
• People who want simple technology to access the data            May seem easy, but actually, it is too complex for the
• People who want a systematic approach for making                   average users.
  decision                                                        Sometimes warehouse users will develop different
• Users who want a huge amount of data which is a                    business rules
  necessity for reports, grids or charts
• DW is a first step if you want to discover ‘hidden patterns’
  or data-flows and groupings
DATA WAREHOUSE TOOLS
MarkLogic
 -   useful data warehousing solution that makes data
     integration easier and faster using an array of
     enterprise. This tool helps to perform very complex
     search operations. It can query different types of data
     like documents, relationships, and metadata
Oracle
 -    the industry-leading database. If offers a wide range of
      choice of data warehouse solutions for both on-
      premises and in the cloud
Amazon Redshift
 -   a simple and cost-effective tool to analyze all types of
     data using standard SQL and existing BI tools