0% found this document useful (0 votes)
13 views33 pages

Unit 1 NMP-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views33 pages

Unit 1 NMP-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

CCS341 DATA WAREHOUSING

UNIT I INTRODUCTION TO DATA WAREHOUSE

Data warehouse Introduction - Data warehouse components- operational database Vs data warehouse
– Data warehouse Architecture – Three-tier Data Warehouse Architecture – Autonomous Data
Warehouse- Autonomous Data Warehouse Vs Snowflake - Modern Data Warehouse

UNIT I INTRODUCTION TO DATA WAREHOUSE

TOPIC 1 DATA WAREHOUSE INTRODUCTION

A Data Warehouse (DW or DWH) is a centralized system used for storing, managing, and analyzing
large volumes of data collected from various sources. It is primarily designed to support decision-
making by providing historical, consolidated, and organized data.

Example: A retail company uses a data warehouse to combine, sales data from store branches, online
purchase data and inventory data from warehouses. They analyse it to track sales performance,
manage stock levels, and plan marketing strategies.

Key Features of a Data Warehouse

1. Subject-Oriented: Data is organized around key subjects like customers, products, or sales.
2. Integrated: Combines data from multiple sources into a single, consistent format.
3. Time-Variant: Stores historical data (not just current values), allowing analysis over time.
4. Non-Volatile: Once data is entered, it is not changed or deleted, ensuring consistency.

Goals of Data Warehousing:

1. Support Decision-Making

To help business leaders, analysts, and managers make informed and accurate decisions using
historical and current data.

Example: A sales manager uses the data warehouse to see which products performed best last year to
plan next season’s stock.

2. Integrate Data from Multiple Sources

To collect and unify data from various departments and systems (e.g., sales, finance, HR) into one
consistent format.

Example: Combine online and in-store sales data to get a complete customer view.
3. Provide Historical Data

To store and manage large volumes of past data, enabling trend analysis and forecasting.

Example: Analyze 5 years of customer purchase data to identify buying trends.

4. Improve Data Quality and Consistency

To clean, transform, and standardize data from different sources, ensuring reliability.

Example: Make sure “Customer Name” is in the same format across all records.

5. Enable Fast Query Performance

To optimize data storage and indexing for fast reporting and analytics.

Example: Generate complex sales reports in seconds instead of hours.

6. Support Business Intelligence (BI) and Reporting Tools

To serve as the backend for dashboards, KPIs, and visual analytics tools.

Example: Feed data to Tableau or Power BI dashboards used by executives.

7. Ensure Data Security and Access Control

To manage who can access what data and protect sensitive information.

Example: Only HR can access salary data, while sales managers can only see customer purchase info.

8. Facilitate Scalability and Flexibility

To allow expansion as more data is collected or new departments are added.

Example: Add marketing campaign data next year without redesigning the system.

Summary

Goal Description Example


1. Support Decision- Help stakeholders make data-driven Use sales reports to decide future
Making decisions product lines
2. Integrate Data Combine data from various systems Merge online and in-store sales
Sources (CRM, ERP, etc.) data
3. Provide Historical Maintain past records for analysis and Analyze 3-year customer buying
Data trends behavior
4. Improve Data Convert all date formats to a single
Clean, validate, and standardize data
Quality standard
5. Enable Fast Query Optimize for quick, complex queries Generate real-time revenue
Performance and reporting dashboards
6. Support BI and Act as a backend for visualization tools Feed visual dashboards for top
Reporting like Tableau or Power BI management
Goal Description Example
7. Ensure Security & Provide role-based access to sensitive HR can access salary data, Sales
Access data can't
8. Ensure Scalability & Adapt to growing data needs and new Add new marketing data sources
Flexibility data sources as the business grows

Need for a Data Warehouse

A data warehouse is essential for modern organizations to manage, analyze, and make decisions using
large volumes of data.

1. Centralized Data Management

Data is often scattered across departments and systems. A data warehouse brings all this data
together in one place. The benefit is single source of truth for the organization.

2. Efficient Reporting and Analysis

Raw operational databases are not optimized for complex queries. Data warehouses support
fast and complex analytical queries. The benefits are quick insights and better decision-
making.

3. Historical Data Storage

Operational databases typically store only recent data. A data warehouse keeps historical data
for years. The benefits are enabling trend analysis, forecasting, and long-term strategy.

4. Business Intelligence and Insights

Companies need analytics to compete effectively. Data warehouses are the foundation for BI
tools. It helps to identify patterns, reduce costs, and improve performance.

5. Data Quality and Consistency

Different systems may use inconsistent formats or have errors. ETL processes in data
warehousing clean and standardize data. The benefits are accurate and reliable reporting.

6. Better Data Security and Access Control

Not everyone should access all data. Data warehouses allow for user-based access controls.
The real benefit is it protects sensitive data while supporting collaboration.
7. Scalability and Flexibility

Data volumes grow over time. Data warehouses can scale with growing data needs. The real
benefit is it supports future growth and new data sources.

8. Integration with BI Tools

Visualization and dashboards are crucial for decision-makers. Warehouses integrate with
Power BI, Tableau, Looker, etc. Benefits are it makes analytics user-friendly and accessible.

History of Data Warehouse

The history of data warehousing can be traced through several decades of development in database
and business intelligence technologies. Here's a timeline and evolution overview:

1. 1960s–1970s: Early Data Processing

 Data was stored in flat files or hierarchical and network databases (e.g., IBM IMS).
 Focus was on transactional processing, not analytical needs.
 Businesses began to recognize the value of storing and retrieving historical data for analysis.

2. 1980s: Emergence of Decision Support Systems (DSS)

 Companies started developing Decision Support Systems to analyze business data.


 Data was often copied from operational systems into offline databases for reporting.
 These systems were not scalable and lacked integration.

3. 1990: Birth of the Term "Data Warehouse"

 Bill Inmon, known as the "Father of Data Warehousing", formally introduced the concept.
o He defined a data warehouse as:

"A subject-oriented, integrated, time-variant, and non-volatile collection of data to


support management’s decision-making process."

 Goal: Create a central repository of integrated data from different sources.

4. 1990s: Commercial Adoption and Architecture Models

 Ralph Kimball, another pioneer, proposed a bottom-up approach using data marts and
dimensional modeling (star schemas).
 Major vendors like IBM, Oracle, Microsoft started offering data warehousing tools.
 ETL (Extract, Transform, Load) tools emerged to handle data preparation.

5. 2000s: Growth in Business Intelligence and OLAP

 Data warehouses became central to Business Intelligence (BI) systems.


 Online Analytical Processing (OLAP) systems allowed fast querying and reporting.
 Enterprises adopted data mining and dashboards for insights.
6. 2010s: Big Data and Cloud Warehousing

 The rise of Big Data and NoSQL databases led to new data storage paradigms.
 Tools like Hadoop and Spark offered distributed data processing.
 Emergence of Cloud Data Warehouses such as:
o Amazon Redshift
o Google BigQuery
o Snowflake
 Shift toward real-time analytics and data lakes.

7. 2020s–Present: Modern Data Warehousing

 Unified platforms combine data warehouse + data lake (a.k.a. Lakehouse architecture).
 AI/ML integration for predictive analytics.
 Focus on:
o Scalability
o Data governance
o Self-service BI
o Real-time streaming analytics
 Examples: Databricks, Azure Synapse, Snowflake

SUMMARY TABLE

Era Key Features Technologies


1960s–70s Early data storage Flat files, IMS
1980s Decision support systems Custom-built DSS
1990s Data warehouse principles (Inmon, Kimball) ETL tools, RDBMS
2000s BI and OLAP expansion OLAP cubes, SAP BW
2010s Big Data & Cloud Hadoop, Redshift, BigQuery
2020s Unified platforms, AI/ML Snowflake, Synapse, Lakehouse

Benefits of Data Warehouse

1. Improved Decision Making


 Combines data from multiple sources into a unified view.
 Supports data-driven decisions with accurate, consistent, and timely information.
 Enables trend analysis, forecasting, and business performance tracking.

2. High-Quality and Consistent Data


 Data is cleaned, transformed, and standardized during ETL (Extract, Transform, Load).
 Ensures uniform naming conventions, data formats, and coding structures.
 Reduces errors caused by inconsistent data across departments.

3. Historical Intelligence
 Maintains historical data for years.
 Enables long-term trend analysis and comparative reporting over time.
 Crucial for identifying seasonal patterns, customer behavior, or market shifts.

4. Faster Query Performance


 Optimized for analytical queries (OLAP), not transactional workloads (OLTP).
 Indexes, materialized views, and pre-aggregation improve performance.
 Reduces load on operational systems.

5. Enhanced Business Intelligence


 Powers dashboards, reports, visualizations, and analytics tools.
 Integrates easily with BI platforms like Power BI, Tableau, or Looker.
 Supports role-based access for customized insights.

6. Time-Saving and Cost-Efficient


 Automates data collection, integration, and transformation.
 Saves time spent on manual data preparation.
 Reduces the cost of repetitive reporting and data management tasks.

7. Better Data Security and Governance


 Centralized system with controlled access and audit capabilities.
 Supports compliance with regulations like GDPR, HIPAA, etc.
 Data lineage and metadata management improve transparency.

8. Scalability
 Modern data warehouses (like Snowflake, BigQuery) scale horizontally to handle big data.
 Cloud-based solutions offer on-demand scaling and cost optimization.

9. Supports Advanced Analytics


 Acts as a foundation for machine learning, predictive modeling, and data mining.
 Facilitates integration with tools like Python, R, or SAS.

10. Competitive Advantage


 Enables businesses to react faster to market changes.
 Provides a 360-degree view of customers and operations.
 Helps in personalized marketing, inventory optimization, and risk management.
Summary Table: Benefits of Data Warehouse
Benefit Description

Improved Decision Making Provides accurate, integrated data to support strategic decisions.

High-Quality & Consistent


Cleans and standardizes data from multiple sources.
Data

Historical Intelligence Stores long-term historical data for trend and pattern analysis.

Faster Query Performance Optimized for analytical queries, reducing processing time.

Enhanced Business Intelligence Enables dashboards, reports, and visualizations for insights.

Time & Cost Efficiency Automates data processes, reducing manual work and costs.

Data Security & Governance Offers controlled access, auditing, and compliance support.

Scales with growing data volumes, especially in cloud


Scalability
environments.

Supports Advanced Analytics Facilitates AI, ML, and data mining applications.

Competitive Advantage Helps businesses stay ahead through better data-driven actions.

DATA WAREHOUSE COMPONENTS

Component Description
Operational systems (e.g., ERP, CRM, IoT sensors), external data, flat
1. Data Sources
files, APIs that generate or store raw data.
Processes that extract data from sources, clean/transform it into a
2. ETL (Extract,
suitable format, and load it into the warehouse. Tools: Informatica,
Transform, Load)
Talend, Apache NiFi, etc.
Temporary storage space where data is held and cleaned before being
3. Data Staging Area
moved to the warehouse. Ensures data quality and consistency.
4. Data Storage Centralized database that stores integrated, historical data for analysis.
(Warehouse Repository) Examples: Amazon Redshift, Google BigQuery, Snowflake.
"Data about data" – stores information about source, structure, lineage,
5. Metadata
and transformation rules. Helps users understand and manage data.
Subsets of the data warehouse focused on specific business lines or
6. Data Marts
departments (e.g., sales, marketing).
7. OLAP Engine (Online Allows multidimensional analysis (slicing, dicing, drill-down, roll-up).
Analytical Processing) Can be MOLAP, ROLAP, or HOLAP.
8. BI Tools / Front-End Interfaces for end-users to access data: dashboards, reports, queries.
Tools Tools: Power BI, Tableau, QlikView.
9. Data Governance & Controls for data access, integrity, quality, and compliance (e.g., role-
Security based access, auditing, encryption).
Component Description
10. Monitoring & Tools for performance tuning, job scheduling, and system health
Management Tools monitoring.

DATA WAREHOUSE COMPONENTS

A.Source Data Component

 This refers to the operational systems and external sources where the raw data originates.
 It includes legacy systems, relational databases, flat files, ERP systems, and external market
data.

There are three major types of source data components

1. Internal Operational Systems

These are the primary source of data for most data warehouses and include:

 Legacy systems (mainframe-based systems with COBOL files, VSAM, etc.)


 Client/server applications
 ERP systems (like SAP, Oracle E-Business Suite)
 Online transaction processing (OLTP) systems

These systems support day-to-day business operations such as order processing, payroll, billing, etc.

2. Archived Data

This includes:

 Historical operational data that has been archived for long-term storage
 Often stored in tape, microfiche, or other offline media
 Essential for supporting the time-variant nature of data warehousing

Archived data helps in understanding long-term trends and patterns.

3. External Data

Data acquired from sources outside the organization, such as:

 Market research data


 Stock market feeds
 Economic indicators
 Third-party vendor data (e.g., Dun & Bradstreet, Nielsen)

External data enriches internal data for better analytics and competitive insights.These source data
types are often heterogeneous in terms of format, structure, and platform, requiring robust ETL
(Extract, Transform, Load) processes for integration and standardization before entering the data
warehouse

B.Data Staging Component

 Responsible for data extraction, transformation, and loading (ETL).


 Involves data cleansing, data integration, and preparation for storage.
 Acts as a temporary area for data before it is loaded into the data warehouse.

ETL stands for Extract, Transform, Load, and it is a critical process in data warehousing. It refers to
the set of activities used to move data from source systems into the data warehouse.

1. Extract

 This is the first step where data is collected from various source systems.
 Sources can include:
o Operational databases
o Legacy systems
o ERP systems
o Flat files
o External data providers
 The goal is to read and copy the data without altering the original source.

Example: Extracting customer data from an Oracle ERP system.

2. Transform

 This step cleans, converts, and integrates the extracted data.


 Key activities include:
o Data cleansing (removing duplicates, fixing errors)
o Data conversion (e.g., changing date formats, currencies)
o Business rule application (e.g., calculating profit margins)
o Integration from multiple sources into a unified structure
Example: Converting all currency values to USD and standardizing product names.

3. Load

 The final step involves loading the transformed data into the data warehouse.
 Loading can be:
o Initial Load (first-time full data)
o Incremental Load (only new or updated data)
 Data is loaded into dimensional models like Star or Snowflake schemas.

Example: Loading monthly sales data into the “Fact_Sales” table in the warehouse.

Why ETL is Important

 Ensures data quality, consistency, and usability


 Enables accurate reporting and analytics
 Supports business decision-making by transforming raw data into valuable insights

C.Data Storage Component

 This is the central repository of the data warehouse.


 It stores integrated, time-variant, subject-oriented, and nonvolatile data.
 Often structured using dimensional models like star and snowflake schemas.

This component is essential for supporting business intelligence, decision-making, and analytical
processing.

D.Information Delivery Component


The Information Delivery Component is the front-end interface of the data warehouse. It allows
users—like executives, analysts, and business managers—to access, analyze, and retrieve data to
support decision-making.

 Provides tools for querying, reporting, analysis, and data mining.


 Supports users such as executives, analysts, and managers in decision-making.
 Includes dashboards, OLAP tools, scorecards, etc.

Key Functionalities of Information Delivery Component

Function Description

Query Management Users can submit custom or predefined queries to retrieve data.
Function Description

Routine reports (e.g., daily sales, monthly performance) are


Standard Reporting
delivered.

Ad-hoc Reporting Users create on-demand reports tailored to specific needs.

OLAP (Online Analytical


Multidimensional analysis (slice, dice, drill-down, roll-up).
Processing)

Dashboards Interactive panels that display KPIs and metrics.

Scorecards Track business performance against strategic goals.

Notify users of specific thresholds/events (e.g., low stock


Alerts & Notifications
alerts).

Types of Tools used :


 Reporting Tools: SAP Crystal Reports, JasperReports, SSRS
 OLAP Tools: Microsoft Excel PivotTables, IBM Cognos, SAP BW
 Data Visualization: Tableau, Power BI, Qlik
 Web Portals & Browsers: For remote and secure access
 Mobile BI Apps: Access dashboards and KPIs on mobile devices

User Categories

 Executives: Use dashboards and scorecards for strategic decisions


 Managers: Use OLAP and reports for operational decisions
 Analysts: Use query tools for deep data exploration

E.Metadata Component

The Metadata Component is the "data about data" in a data warehouse. It plays a crucial role by
describing the structure, operations, source, transformation, and meaning of the data stored in the
warehouse. Without metadata, users and even systems cannot properly understand or trust the
warehouse content. The Metadata Component is not just helpful—it is essential. It provides
transparency, ensures trust, and makes the data warehouse truly usable for both technical users and
business decision-makers.

F.Management and Control Component

The Management and Control Component acts as the "brain" or command center of the data
warehouse. It oversees and coordinates the entire functioning of the warehouse—ensuring data is
extracted, processed, stored, secured, and made available for users effectively and reliably. This
component is responsible for the automation, scheduling, monitoring, and administration of all data
warehousing processes.

Key Components Within Management Module

 Workflow engine: Controls the sequence and dependencies of tasks


 Monitoring dashboard: Real-time view of data load status, performance metrics
 Alerts and notifications: Automated warnings for job failures or performance drops
 Audit logs: Maintain logs for changes, access events, and system activities

Importance of Management and Control

 Ensures data integrity and reliability


 Minimizes downtime and processing delays
 Supports scalability and performance tuning
 Guarantees compliance with data governance and security policies

G. Metadata in the Data Warehouse

Metadata is often referred to as the "data about data" and in the context of a data warehouse, it plays a
critical role in understanding, managing, and using the warehouse effectively. Metadata is the
unifying layer that connects all components of a data warehouse.

Metadata describes the origin, structure, transformation, usage, and meaning of the data stored in the
warehouse.

It tells you:

 What data exists in the warehouse


 Where it came from
 How it was transformed
 What it means in business terms
 How and when it was last updated

Types of Metadata

Type Description
Technical Metadata Describes the schema, data types, table definitions, ETL processes, indexes
Business Metadata Explains business definitions, KPIs, metrics, rules, and user-friendly
Type Description
descriptions
Operational
Tracks job logs, data refresh times, load status, and error reports
Metadata

Data warehouse building blocks or components

OPERATIONAL DATABASE VS DATA WAREHOUSE

Feature Operational Database (OLTP) Data Warehouse (OLAP)


Supports day-to-day business Supports strategic decision making and
Purpose
operations analysis
Transaction processing (e.g., order Data analysis, reporting, and decision
Focus
entry, payroll, invoicing) support
Data Type Current, real-time data Historical, time-variant data
Consolidated from multiple sources,
Data Source Directly from business transactions
including operational databases
Highly normalized for fast Denormalized, optimized for queries and
Data Structure
inserts/updates reads
Processing Type Online Transaction Processing Online Analytical Processing (OLAP)
Feature Operational Database (OLTP) Data Warehouse (OLAP)
(OLTP)
Users Clerks, front-line employees Managers, analysts, executives
Query Complexity Simple, predefined Complex, ad-hoc, exploratory
Periodic batch updates (e.g., daily,
System Updates Frequent updates and changes
weekly)
Data Volume Smaller, limited to recent data Large, includes historical data
Performance For speed and efficiency of data For fast retrieval and complex analytical
Optimization entry queries
High – many simultaneous users Moderate – fewer users analyzing large
Concurrency
performing transactions data sets

Some key differences between operational database systems and data warehouses include:

1. Purpose: Operational database systems are used to support day-to-day operations of an


organization, while data warehouses are used to support decision-making and analysis
activities.
2. Data Structure: Operational database systems typically have a normalized data structure,
which means that the data is organized into many related tables to reduce data redundancy and
improve data consistency. Data warehouses, on the other hand, typically have a denormalized
data structure, which means that the data is organized into fewer tables optimized for
reporting and analysis.
3. Data Volume: Operational database systems typically store a smaller volume of data
compared to data warehouses, which may store years of historical data.
4. Performance: Operational database systems are optimized for transaction processing and are
designed to support high-volume, high-speed transaction processing. Data warehouses, on the
other hand, are optimized for querying and reporting and are designed to support complex
analytical queries that may involve large volumes of data.

Comparison Table
Feature Operational DB (OLTP) Data Warehouse (OLAP)
Main Goal Transaction processing Data analysis/reporting
Data Type Real-time, current data Historical, aggregated data
Operations Read/write (frequent) Read (heavy, complex queries)
Schema Design Highly normalized Denormalized (star/snowflake)
Query Speed Optimized for fast inserts Optimized for fast reads
Users Operational staff Analysts, managers
Data Update Frequency Constantly updated Periodically updated (ETL)
DATA WAREHOUSE VS DATA MART

Feature Data Warehouse Data Mart


A centralized repository of integrated A subset of a data warehouse focused on a
Definition
data from across the enterprise specific business line or department
Scope Enterprise-wide Department-level (e.g., sales, finance, HR)
Comprehensive and detailed across all
Data Coverage Subject-specific, limited data
functions
Multiple operational systems and data Often sourced from the data warehouse or
Data Source
marts directly from OLTP
Size Large (in terabytes or petabytes) Smaller (in gigabytes or terabytes)
More complex architecture and
Complexity Simpler and quicker to build
processing
Time to
Longer (months to years) Shorter (weeks to months)
Implement
Can be maintained by individual
Maintenance Requires centralized IT and governance
departments
Strategic decision-making across the Tactical or analytical use within a specific
Purpose
organization unit
Design
Top-down (Inmon approach) Bottom-up (Kimball approach)
Approach

A Data Warehouse provides a single, consistent view of enterprise data for strategic decision making.
A Data Mart is focused on the needs of a particular group and can exist independently or be b on a
data warehouse.

 Dependent Data Mart: Extracts data from a centralized data warehouse


 Independent Data Mart: Built directly from operational or external sources without a data
warehouse
DATA WAREHOUSE ARCHITECTURE

Properties of Data Warehouse Architecture

1. Separation: Analytical and transactional processing should be keep apart as much as


possible.
2. Scalability: Hardware and software architectures should be simple to upgrade the data
volume, which has to be managed and processed, and the number of user's requirements,
which have to be met, progressively increase.
3. Extensibility: The architecture should be able to perform new operations and technologies
without redesigning the whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the data
warehouses.
5. Administerability: Data Warehouse management should not be complicated.

A Data Warehouse is systems that combines data from multiple sources, organizes it under a single
architecture, and helps organizations make better decisions. It simplifies data handling, storage, and
reporting, making analysis more efficient. Data Warehouse Architecture uses a structured framework
to manage and store data effectively.

There are two common approaches to constructing a data warehouse:

 Top-Down Approach: This method starts with designing the overall data warehouse
architecture first and then creating individual data marts.

 Bottom-Up Approach: In this method, data marts are built first to meet specific business
needs, and later integrated into a central data warehouse.
Before diving deep into these approaches, we will first discuss the components of data warehouse
architecture.

Components of Data Warehouse Architecture

A data warehouse architecture consists of several key components that work together to store,
manage, and analyze data.

 External Sources: External sources are where data originates. These sources provide a variety
of data types, such as structured data (databases, spreadsheets); semi-structured data (XML,
JSON) and unstructured data (emails, images).

 Staging Area: The staging area is a temporary space where raw data from external sources is
validated and prepared before entering the data warehouse. This process ensures that the data
is consistent and usable. To handle this preparation effectively, ETL (Extract, Transform,
Load) tools are used.

o Extract (E): Pulls raw data from external sources.

o Transform (T): Converts raw data into a standard, uniform format.

o Load (L): Loads the transformed data into the data warehouse for further processing.

 Data Warehouse: The data warehouse acts as the central repository for storing cleansed and
organized data. It contains metadata and raw data. The data warehouse serves as the
foundation for advanced analysis, reporting, and decision-making.

 Data Marts: A data mart is a subset of a data warehouse that stores data for a specific team or
purpose, like sales or marketing. It helps users quickly access the information they need for
their work.

 Data Mining: Data mining is the process of analyzing large datasets stored in the data
warehouse to uncover meaningful patterns, trends, and insights. The insights gained can
support decision-making, identify hidden opportunities, and improve operational efficiency.

Top-Down Approach

The Top-Down Approach, introduced by Bill Inmon, is a method for designing data warehouses that
starts by building a centralized, company-wide data warehouse. This central repository acts as the
single source of truth for managing and analyzing data across the organization. It ensures data
consistency and provides a strong foundation for decision-making.

Working of Top-Down Approach

 Central Data Warehouse: The process begins with creating a comprehensive data warehouse
where data from various sources is collected, integrated, and stored. This involves the ETL
(Extract, Transform, Load) process to clean and transform the data.

 Specialized Data Marts: Once the central warehouse is established, smaller, department-
specific data marts (e.g., for finance or marketing) are built. These data marts pull information
from the main data warehouse, ensuring consistency across departments.

Advantages of Top-Down Approach

1. Consistent Dimensional View: Data marts are created directly from the central data warehouse,
ensuring a consistent dimensional view across all departments. This minimizes discrepancies and
aligns data reporting with a unified structure.

2. Improved Data Consistency: By sourcing all data marts from a single data warehouse, the
approach promotes standardization. This reduces the risk of errors and inconsistencies in reporting,
leading to more reliable business insights.
3. Easier Maintenance: Centralizing data management simplifies maintenance. Updates or changes
made in the data warehouse automatically propagate to all connected data marts, reducing the effort
and time required for upkeep.

4. Better Scalability: The approach is highly scalable, allowing organizations to add new data marts
seamlessly as their needs grow or evolve. This is particularly beneficial for businesses experiencing
rapid expansion or shifting demands.

5. Enhanced Governance: Centralized control of data ensures better governance. Organizations can
manage data access, security, and quality from a single point, ensuring compliance with standards
and regulations.

6. Reduced Data Duplication: Storing data only once in the central warehouse minimizes
duplication, saving storage space and reducing inconsistencies caused by redundant data.

7. Improved Reporting: A consistent view of data across all data marts enables more accurate and
timely reporting. This enhances decision-making and helps drive better business outcomes.

8. Better Data Integration: With all data marts being sourced from a single warehouse, integrating
data from multiple sources becomes easier. This provides a more comprehensive view of
organizational data and improves overall analytics capabilities.

Disadvantages of Top-Down Approach

1. High Cost and Time-Consuming: The Top-Down Approach requires significant investment in
terms of cost, time, and resources. Designing, implementing, and maintaining a central data
warehouse and its associated data marts can be a lengthy and expensive process, making it
challenging for smaller organizations.

2. Complexity: Implementing and managing the Top-Down Approach can be complex, especially for
large organizations with diverse and intricate data needs. The design and integration of a centralized
system demand a high level of expertise and careful planning.

3. Lack of Flexibility: Since the data warehouse and data marts are designed in advance, adapting to
new or changing business requirements can be difficult. This lack of flexibility may not suit
organizations that require dynamic and agile data reporting capabilities.
4. Limited User Involvement: The Top-Down Approach is often led by IT departments, which can
result in limited involvement from business users. This may lead to data marts that fail to address the
specific needs of end-users, reducing their overall effectiveness.

5. Data Latency: When data is sourced from multiple systems, the Top-Down Approach may
introduce delays in data processing and availability. This latency can affect the timeliness and
accuracy of reporting and analysis.

6. Data Ownership Challenges: Centralizing data in the data warehouse can create ambiguity
around data ownership and responsibilities. It may be unclear who is accountable for maintaining and
updating the data, leading to potential governance issues.

7. Integration Challenges: Integrating data from diverse sources with different formats or structures
can be difficult in the Top-Down Approach. These challenges may result in inconsistencies and
inaccuracies in the data warehouse.

8. Not Ideal for Smaller Organizations: Due to its high cost and resource requirements, the Top-
Down Approach is less suitable for smaller organizations or those with limited budgets and simpler
data needs.

Bottom-Up Approach

The Bottom-Up Approach, popularized by Ralph Kimball, takes a more flexible and incremental path
to designing data warehouses. Instead of starting with a central data warehouse, it begins by building
small, department-specific data marts that cater to the immediate needs of individual teams, such as
sales or finance. These data marts are later integrated to form a larger, unified data warehouse.

Working of Bottom-Up Approach

 Department-Specific Data Marts: The process starts with creating data marts for individual
departments or specific business functions. These data marts are designed to meet immediate
data analysis and reporting needs, allowing departments to gain quick insights.

 Integration into a Data Warehouse: Over time, these data marts are connected and
consolidated to create a unified data warehouse. The integration ensures consistency and
provides a comprehensive view of the organization’s data.
Advantages of Bottom-Up Approach

1. Faster Report Generation: Since data marts are created first, reports can be generated quickly,
providing immediate value to the organization. This enables faster insights and decision-making.

2. Incremental Development: This approach supports incremental development by allowing the


creation of data marts one at a time. Organizations can achieve quick wins and gradually improve
data reporting and analysis over time.

3. User Involvement: The Bottom-Up Approach encourages active involvement from business users
during the design and implementation process. Users can provide feedback on data marts and reports,
ensuring the solution meets their specific needs.

4. Flexibility: This approach is highly flexible, as data marts are designed based on the unique
requirements of specific business functions. It is particularly beneficial for organizations that require
dynamic and customizable reporting and analysis.

5. Faster Time to Value: With quicker implementation compared to the Top-Down Approach, the
Bottom-Up Approach delivers faster time to value. This is especially useful for smaller organizations
with limited resources or businesses looking for immediate results.

6. Reduced Risk: By creating and refining individual data marts before integrating them into a larger
data warehouse, this approach reduces the risk of failure. It also helps identify and resolve data
quality issues early in the process.
7. Scalability: The Bottom-Up Approach is scalable, allowing organizations to add new data marts as
needed. This makes it an ideal choice for businesses experiencing growth or undergoing significant
change.

8. Clarified Data Ownership: Each data mart is typically owned and managed by a specific business
unit, which helps clarify data ownership and accountability. This ensures data accuracy, consistency,
and proper usage across the organization.

9. Lower Cost and Time Investment: Compared to the Top-Down Approach, the Bottom-Up
Approach requires less upfront cost and time to design and implement. This makes it an attractive
option for organizations with budgetary or time constraints.

Disadvantage of Bottom-Up Approach

1. Inconsistent Dimensional View: Unlike the Top-Down Approach, the Bottom-Up Approach may
not provide a consistent dimensional view of data marts. This inconsistency can lead to variations in
reporting and analysis across departments.

2. Data Silos: This approach can result in the creation of data silos, where different business units
develop their own data marts independently. This lack of coordination may cause redundancies, data
inconsistencies, and difficulties in integrating data across the organization.

3. Integration Challenges: Integrating multiple data marts into a unified data warehouse can be
challenging. Differences in data structures, formats, and granularity may lead to issues with data
quality, accuracy, and consistency.

4. Duplication of Effort: In a Bottom-Up Approach, different business units may inadvertently


duplicate efforts by creating data marts with overlapping or similar data. This can result in
inefficiencies and increased costs in data management.

5. Lack of Enterprise-Wide View: Since data marts are typically designed to meet the needs of
specific departments, this approach may not provide a comprehensive, enterprise-wide view of data.
This limitation can hinder strategic decision-making and limit an organization’s ability to analyze
data holistically.

6. Complexity in Management: Managing and maintaining multiple data marts with varying
complexities and granularities can be more challenging compared to a centralized data warehouse.
This can lead to higher maintenance efforts and potential difficulties in ensuring long-term
scalability.

7. Risk of Inconsistency: The decentralized nature of the Bottom-Up Approach increases the risk of
data inconsistency. Differences in data structures and definitions across data marts can make it
difficult to compare or combine data, reducing the reliability of reports and analyses.

8. Limited Standardization: Without a central repository to enforce standardization, the Bottom-Up


Approach may lack uniformity in data formats and definitions. This can complicate collaboration and
integration across departments.

Types of Data Warehouse Architecture

Architecture Type Description When to Use


1. Centralized All enterprise data is stored in a single, When full integration and consistency
Data Warehouse integrated warehouse. are critical across the organization.
Separate data marts are created for When each department operates
2. Independent
individual departments without a central independently and integration is not
Data Marts
warehouse. needed.
A central data warehouse feeds When enterprise-wide consistency is
3. Hub-and-Spoke
dependent data marts tailored to needed, but local customization is also
Architecture
different business units. important.
A collection of integrated data marts When quick departmental
4. Data Mart Bus
built around conformed dimensions implementations are desired, with
Architecture
(Kimball approach). future integration in mind.
Existing data warehouses and marts are
5. Federated When systems are already in place, and
integrated virtually without physically
Architecture minimal changes are preferred.
combining the data.

Summary of Key Characteristics

Independent Hub-and- Bus


Feature Centralized Federated
Marts Spoke Architecture
Data Integration High Low High Medium to High Variable
Time to Implement Long Short Medium Medium Short
Scalability Moderate High High High High
Maintenance
High Low Moderate Moderate High
Complexity
Data Redundancy Low High Medium Low High
Data warehouse architectural types

Three-Tier Architecture of Data Warehouse

The Three/Multi-Tier Data Warehouse Architecture is widely adopted due to its clear and organized
framework. This architecture divides data handling into three main layers:

 Bottom Tier (Data Sources and Data Storage)

 Middle Tier (OLAP Engine)

 Top Tier (Front-End Tools)


a.Bottom Tier
The bottom tier is the Data Source Layer, and it plays a foundational role in the three-tier data
warehouse architecture. It is responsible for collecting, preparing, and feeding data into the
warehouse. The bottom tier is not for analysis, but rather for preparing data to be useful for analytics.
It ensures that only high-quality, consistent, and integrated data is passed to the middle tier for
storage and retrieval.
Functions of the Bottom Tier

Function Description
Pulls raw data from multiple sources: – Operational databases (OLTP)–
1. Data Extraction
Legacy systems– External sources (e.g., web data, spreadsheets)
2. Data Cleansing Fixes inconsistencies, removes duplicates, and standardizes values
3. Data Converts data into a consistent format: – Data type conversion– Aggregation–
Transformation Key generation
4. Data Integration Merges data from different sources into a unified schema
Temporarily stores cleaned and transformed data before loading it into the
5. Data Staging
warehouse
6. Load into
Sends the prepared data to the middle tier (data warehouse repository)
Warehouse

b.Middle Tier
The Middle Tier—also called the Data Storage Layer—is the core of the data warehouse architecture.
It is where the cleansed, integrated, and transformed data is stored and managed to support efficient
retrieval for analysis and reporting. The Middle Tier in the three-tier architecture of a data warehouse
is where the OLAP (Online Analytical Processing) server resides. This tier acts as the critical
processing layer that manages and enables complex analytical queries on data stored in the bottom
tier. It functions as a mediator between the data repository (bottom tier) and the end-user interface
(top tier), ensuring that data is efficiently processed and made ready for analysis and reporting.

Key Functions of the Middle Tier

Function Description
Stores the processed data from the bottom tier in a structured form (e.g.,
1. Data Storage
fact and dimension tables)
Organizes data using warehouse schemas (e.g., Star, Snowflake) for
2. Schema Management
efficient querying
3. Metadata Stores information about the data: source, structure, transformation logic,
Management update frequency, etc.
Supports pre-aggregation for performance (e.g., weekly/monthly
4. Data Aggregation
summaries)
5. History Management Maintains historical data to support time-variant analysis
Enables fast query performance via indexing, partitioning, and
6. Query Optimization
materialized views
7. Security & Access
Manages user permissions and ensures data integrity and confidentiality
Control

C.TOP Tier
The Top Tier is the Presentation or Information Delivery Layer of a data warehouse. It is the user-
facing part of the architecture where business users, analysts, and decision-makers interact with the
data. To deliver strategic information to end-users in a form they can easily analyze, explore, and use
for decision-making. The Top Tier is all about making data actionable. It converts stored data into
visual, analytical, and decision-support outputs for stakeholders at all levels.

Function Description
Allows users to run predefined or ad hoc queries to retrieve specific
1. Query & Reporting
information
2. Data Analysis & Enables OLAP (Online Analytical Processing) for slicing, dicing,
Exploration drilling down, and pivoting data
Provides dashboards, charts, graphs, and scorecards for quick insight and
3. Visualization
trend monitoring
4. Business Intelligence Offers interfaces like Tableau, Power BI, SAP BO, or Excel for intuitive
Tools interaction
Function Description
5. Alerting & Sends alerts or triggers based on thresholds (e.g., drop in yield, profit
Notifications below target)
Empowers users to access data without IT support (through user-friendly
6. Self-Service Access
GUIs)
Helps executives and managers make informed strategic or tactical
7. Decision Support
business decisions

Autonomous Data Warehouse

An Autonomous Data Warehouse (ADW) is a cloud-based data warehouse service that automates the
provisioning, tuning, scaling, patching, backup, and security of the data warehouse environment
without human intervention.

Key features include:

 Self-Driving: Automatically manages performance tuning, indexing, and query optimization.


 Self-Securing: Protects data by automatically applying security patches, encryption, and
threat detection.
 Self-Repairing: Automatically detects failures and recovers without downtime.
 Elastic Scaling: Automatically scales compute and storage resources based on workload
demand.
 Cloud-Native: Delivered as a fully managed service in the cloud, eliminating infrastructure
management.

Why use an Autonomous Data Warehouse?

 Reduced administration effort: No need for manual tuning, patching, or backup tasks.
 Faster performance: AI-driven optimization adapts continuously for best query speed.
 High availability: Automatic failover and repair keep the warehouse online.
 Cost efficiency: Pay only for what you use, scale resources on-demand.
 Better security: Continuous monitoring and automatic application of security best practices.

Examples of Autonomous Data Warehouses

 Oracle Autonomous Data Warehouse: One of the earliest and most widely known ADW
services.
 Amazon Redshift Serverless: Offers some autonomous features with automatic scaling and
management.
 Google BigQuery: Serverless, highly scalable data warehouse with automatic resource
management.
 Microsoft Azure Synapse Analytics: Provides managed analytics with automated workload
optimization.
Architecture of Autonomous Data Warehouse

Core components:

 Data Storage Layer: Stores all data in optimized, often columnar, formats for fast analytics.
 Compute Layer: Dedicated compute resources (CPU, memory) that execute queries; can
scale independently.
 Control Plane: The brain that orchestrates automation tasks like provisioning, patching,
tuning, backup, and security.
 AI & Machine Learning Engine: Continuously monitors workloads, performance, and
system health to automate optimization and self-healing.
 Security & Compliance Layer: Automates encryption, access control, auditing, and threat
detection.

How it works:

 When you load data or run queries, the system dynamically allocates resources as needed.
 The AI engine automatically adjusts indexing, caching, and query plans for optimal speed.
 System health and security are continuously monitored with automatic updates and repairs.
 Users interact via SQL, BI tools, or data pipelines without worrying about infrastructure.

Benefits of Autonomous Data Warehouse

a. Ease of Use

 No manual tuning or complex configuration.


 Simplified management through intuitive dashboards and APIs.

b. Performance

 AI-powered query optimization adapts to workload patterns.


 Parallel processing and resource scaling ensure fast response times.

c. Reliability and Availability

 Automated failover and backup mean minimal downtime.


 Continuous monitoring detects and fixes issues proactively.

d. Security

 Automatic encryption of data at rest and in transit.


 Auto-patching of security vulnerabilities without interrupting service.
 Role-based access control and audit trails are enforced automatically.

e. Cost Efficiency

 Pay-as-you-go pricing models.


 Elastic scaling lets you increase or decrease resources based on demand, avoiding
overprovisioning.
Oracle Autonomous Data Warehouse

Autonomous Data Warehouse Vs Snowflake

Snowflake is a modern, cloud-based data warehousing platform designed to handle large-scale data
analytics with ease. It provides a fully managed service that separates compute and storage, allowing
for flexible, scalable, and fast processing of data.

How Snowflake Works:

 Data is stored in a centralized storage layer.


 Compute resources (called virtual warehouses) run queries and can be scaled independently.
 Multiple virtual warehouses can run concurrently on the same data without contention.
 Users interact with Snowflake using SQL queries through the web interface, BI tools, or
applications.

The Snowflake architecture is composed of three key layers, each with specific components and
responsibilities. Here's a breakdown of the main components of Snowflake architecture as described
in your document:

1. Storage Layer

This layer manages how data is stored in Snowflake.

 Elastic Storage: Scales independently from compute; adjusts to varying data volumes.
 Cloud-Based Object Storage: Uses cloud storage (e.g., AWS S3, Azure Blob, GCP) for
storing all structured and semi-structured data.
 Data Clustering with Micro-Partitions: Automatically divides data into micro-partitions for
efficient access.
 Zero Copy Cloning: Enables creation of copies of data without physically duplicating it,
saving space and time.

2. Query Processing Layer (Compute Layer)


Handles query execution using compute resources (virtual warehouses).

 Automatic Query Processing: Dynamically optimizes and rewrites SQL queries for
efficiency.
 Parallel Execution: Executes queries across multiple compute clusters using MPP
(Massively Parallel Processing).
 On-Demand Resource Allocation: Compute resources are provisioned dynamically based on
query needs.
 Separation of Compute and Storage: Enhances scalability and cost efficiency by
decoupling them.

3. Cloud Services Layer (Control Layer)

Acts as the brain of the architecture, coordinating all activities and services.

 Metadata Management: Manages data definitions, query statistics, and access paths.
 Authentication & Access Control: Provides secure access with MFA and role-based
permissions.
 Query Optimization: Improves performance with dynamic execution plan generation.
 Infrastructure Management: Automatically handles scaling and resource allocation.
 Security: Implements end-to-end encryption, data masking, and access controls.

Snowflake Architecture

Key Features of Snowflake:

 Cloud-Native: Built from the ground up for the cloud, running on major cloud providers like
AWS, Microsoft Azure, and Google Cloud Platform.
 Separation of Compute and Storage: Storage is centralized and elastic, while compute
resources (called "virtual warehouses") can be spun up or down independently based on
workload.
 Scalability: Instantly scale compute power up or down without downtime.
 Zero Management: No infrastructure to manage. Snowflake handles backups, tuning,
updates, and availability automatically.
 Multi-Cluster Architecture: Multiple compute clusters can access the same data
simultaneously, supporting high concurrency without performance loss.
 Support for Structured and Semi-Structured Data: Easily query JSON, Avro, Parquet, and
XML data using SQL.
 Data Sharing: Securely share live data with other Snowflake users or external organizations
without moving or copying data.
 Security: End-to-end encryption, multi-factor authentication, role-based access control, and
compliance with many industry standards.

Why Use Snowflake?

 Flexibility: Pay only for what you use — compute and storage are billed separately.
 Ease of Use: Simplifies data management with automatic tuning and no need for hardware
provisioning.
 Performance: Handles many concurrent users and large volumes of data smoothly.
 Data Collaboration: Easily share data securely across teams and organizations.

Autonomous Data Warehouse Vs Snowflake

Autonomous Data Warehouse


Feature/Aspect Snowflake
(Oracle ADW)
Cloud-based, fully autonomous data
Type Cloud-native data warehouse platform
warehouse
Highly automated: self-driving Automated scaling and tuning, but some
Automation
(tuning), self-securing, self-repairing user control remains
Primarily Oracle Cloud Infrastructure
Cloud Provider Multi-cloud (AWS, Azure, GCP)
(OCI)
Separation of compute and storage; Separation of compute and storage;
Architecture
AI-driven optimization architecture designed for elasticity
Automatic scaling of compute and Instant, elastic scaling of compute
Scalability
storage clusters
Performance Fully automated query optimization Automatic clustering and optimization,
Tuning and indexing user can configure clustering keys
Built-in automatic encryption, Encryption by default; extensive role-
Security
patching, and threat detection based access and data masking
Supports data sharing within Oracle Native secure data sharing across
Data Sharing
ecosystem accounts and organizations
Minimal admin effort; AI handles User-friendly, simple SQL interface, but
Ease of Use
tuning and maintenance some manual tuning possible
Pay for compute and storage Consumption-based with per-second
Pricing Model
separately; auto-scaling billing; separate compute warehouses
Integration Strong integration with Oracle tools, Broad integrations with BI tools, data
Ecosystem ML, and Analytics lakes, and cloud services
Enterprise workloads needing high Flexible workloads including data
Use Cases
automation and security engineering, analytics, and sharing
Backup & Automated backups and recovery Time Travel and Fail-safe features for
Autonomous Data Warehouse
Feature/Aspect Snowflake
(Oracle ADW)
Recovery managed by Oracle data recovery
More user control over clusters,
User Control Mostly hands-off due to autonomy
warehouses, and tuning

Modern Data Warehouse

A Modern Data Warehouse is an evolved version of traditional data warehouses designed to handle
today’s diverse, large-scale, and fast-changing data environments. It leverages cloud computing,
advanced data integration, and analytics technologies to deliver more flexibility, scalability, and real-
time insights.

Key Characteristics of a Modern Data Warehouse:

1. Cloud-Native and Scalable: Built on cloud platforms (AWS, Azure, Google Cloud) enabling
elastic storage and compute that scale independently based on demand.
2. Separation of Storage and Compute: Storage and processing resources are decoupled so
you can scale each independently, reducing costs and improving performance.
3. Support for Diverse Data Types: Can handle structured data (tables), semi-structured data
(JSON, XML, Avro), and unstructured data (images, logs).
4. Real-Time and Batch Processing: Supports both batch ETL and real-time data ingestion,
enabling up-to-date analytics.
5. Advanced Analytics and AI Integration: Integrates easily with machine learning
frameworks and BI tools for predictive analytics and data-driven decisions.
6. Data Lake Integration: Often combined with data lakes to create a data lakehouse
architecture, where raw and curated data coexist and are accessible.
7. Self-Service and Democratization: Provides easy-to-use interfaces and tools so business
users, analysts, and data scientists can access data without heavy IT involvement.
8. Automated Management and Governance: Employs automation for tasks like data
cataloguing, lineage tracking, security, and compliance.

Why Modern Data Warehousing?

 Agility: Quickly adapt to changing business needs and data sources.


 Cost Efficiency: Pay-as-you-go cloud models reduce upfront investment.
 Performance: Fast querying and analytics over large, complex datasets.
 Collaboration: Easier data sharing within and across organizations.
 Future-Proofing: Built to handle AI, IoT, and big data workloads.
Popular Modern Data Warehouse Solutions

 Snowflake
 Google BigQuery
 Amazon Redshift
 Microsoft Azure Synapse Analytics
 Oracle Autonomous Data Warehouse

Traditional Data Warehouse Modern Data Warehouse


Rigid, hardware-dependent, costly to
Flexible, cloud-native, cost-efficient and scalable
scale
Supports a variety of data types (structured,
Limited data types, mainly structured
semi/unstructured)
Manual management and tuning Automated, AI-driven management and tuning
IT-centric access model Self-service, democratized data access
Slower to deploy and adapt Rapid provisioning and agile to changing business needs

Note :You have write standard data warehousing components working process when 16 mark
question is asked.

You might also like