0% found this document useful (0 votes)
59 views4 pages

DW Notes

The document provides an overview of data warehousing, detailing its architecture, key components, and characteristics, including ETL processes and metadata management. It distinguishes between data warehouses, data lakes, and data marts, while also discussing the benefits, challenges, and best practices for implementation. Additionally, it outlines various types of data warehouses and tools available for effective data management and analysis.

Uploaded by

Livia Swift
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views4 pages

DW Notes

The document provides an overview of data warehousing, detailing its architecture, key components, and characteristics, including ETL processes and metadata management. It distinguishes between data warehouses, data lakes, and data marts, while also discussing the benefits, challenges, and best practices for implementation. Additionally, it outlines various types of data warehouses and tools available for effective data management and analysis.

Uploaded by

Livia Swift
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

INTRODUCTION TO DATA WAREHOUSING AND COMPONENTS OF DATA WAREHOUSE ARCHITECTURE

MANAGEMENT ETL (Extract, Transform, Load)


− moving of data from a data source into data warehouse
Data Warehouse/Enterprise Data Warehouse (EDW) − converts data into a usable format so that once it’s in
− system that aggregates data from different the data warehouse, it can be analyzed/queried/etc.
(heterogeneous) sources into a single, central, consistent
data store to support data analysis, data mining, artificial Metadata
intelligence (AI), and machine learning. − data about data
− enables an organization to run powerful analytics on huge − describes all of the data that are stored in a system to
volumes (petabytes) of historical data in ways that a make it searchable (authors, dates or locations of an
standard database cannot. article, create date of a file, the size of a file, etc)
− core of the BI system which is built for data analysis and − allows data to be organized to make it usable, so it
reporting. can be analyzed to create dashboards and reports

KEY CHARACTERISTICS OF DATA WAREHOUSE SQL Query Processing


1. Subject-Oriented − SQL is the de facto standard language for querying
- it provides topic-wise information rather than a data; the language that analysts use to pull out
business's overall processes (sales, promotion, inventory, insights from their data stored in the data warehouse.
etc.)
- ex: analyzing a company’s sales data needs a data Data Layer
warehouse that concentrates on sales. − the access layer that allows users to actually get to
the data
2. Integrated − typically where data mart is.
- developed by integrating data from varied sources into a − partitions segments of your data out depending on
consistent format. who you want to give access to, so you can get very
- the data must be stored in the warehouse in a consistent granular across your organization
and universally acceptable manner in terms of naming,
format, and coding for effective analysis. Governance and Security
− related to the data layer in that you need to be able
3. Time-Variant to provide fine-grained access and security policies
- data stored in a data warehouse is documented with an across all of your organization’s data
element of time (explicitly or implicitly)
- ex: the Primary Key, which must have an element of time DATA WAREHOUSE SYSTEM
like the day, week, or month. • Decision Support System (DSS)
• Executive Information System
4. Non-Volatile • Management Information System
- data once entered into a data warehouse must remain • Business Intelligence Solution
unchanged thus all data is read-only. • Analytic Application
- previous data is not erased when current data is entered • Data Warehouse
helping in analyzing what has happened and when.
HOW DATA WAREHOUSE WORKS?
A SHORT HISTORY OF DATA WAREHOUSE ARCHITECTURE − It works as a central repository where information
 built around a relational database system, either on- arrives from one or more data sources.
premise or in the cloud, where data is both stored and − data flows into a data warehouse from the
processed. transactional system and other relational databases
 other components would include metadata − data is processed, transformed, and ingested so that
management and an API connectivity layer allowing the users can access the processed data in the Data
warehouse to pull data from organizational sources and Warehouse through Business Intelligence tools, SQL
provide access to analytics and visualization tools. clients, and spreadsheets
 typical data warehouse has four main components: a − data warehouse merges information coming from
central database, ETL tools, metadata, and access tools. different sources into one comprehensive database
 born in the 1980s, it addressed the need for optimized so an organization can analyze its customers more
analytics on data. As companies’ business applications holistically while considering all the information
began to grow and generate/store more data, they available.
needed a system that could both manage the data and − data warehousing makes data mining possible.
analyze it. − data may be:
 at a high level, database admins could pull data from  Structured - generally stored in tables in the form of rows
their operational systems and add a schema to it via and columns (relational data)
transformation before loading it into their data Semi-Structured - organized up to some extent
warehouse only and the rest is unstructured (XML/RDF)
 metadata became important when data warehouse  Unstructured - unprocessed and unorganized data (Text
architecture evolved and grew in popularity, more files, Emails, Media logs )
people within a company started using it to access data–
− data warehousing makes data mining (looking for
and the data warehouse made it easy to do so with
patterns in the data that may lead to higher sales and
structured data
profits) possible
 reporting and dashboarding became a key use case, and
SQL (structured query language) became the de facto
way of interacting with that data.
DATA WAREHOUSE VS. DATABASE, DATA LAKE, AND DATA TYPES OF DATA WAREHOUSES
MART Cloud Data Warehouse
− specifically built to run in the cloud, and it is offered
Data Warehouse to customers as a managed service
– gathers raw data from multiple sources into a central − seek to reduce on-premises data center footprint
repository, structured using predefined schemas − the physical data warehouse infrastructure is
designed for data analytics managed by the cloud company

Data Lake Data Warehouse Software (on-premise/license)


– centralized repository designed to store, process, and – business can purchase a data warehouse license and
secure large amounts of structured, semi-structured, and then deploy a data warehouse on their own on-
unstructured data (without predefined schemas) premises infrastructure
– can store data in its native format and process any variety – expensive but might be a better choice for
of it, ignoring size limits; compared to a hierarchical data government entities, financial institutions, or other
warehouse, which stores data in files or folders, a data organizations that want more control over their data
lake uses a flat architecture and object storage to store or need to comply with strict security or data privacy
the data standards or regulations
– commonly built on big data platforms such as Apache Data Warehouse Appliance
Hadoop – pre-integrated bundle of hardware and software—
CPUs, storage, operating system, and data warehouse
software—that a business can connect to its network
o designed to capture raw data (structured, semi- and start using as-is
structured, and unstructured) – sits somewhere between cloud and on-premises
o made for large amounts of data implementations in terms of upfront cost, speed of
o used for ML and AI in its current state or for deployment, ease of scalability, and management
analytics with processing control
o can organize and put into databases or DW
BENEFITS OF A DATA WAREHOUSE
1. Better data quality: A data warehouse centralizes data
Data Mart from a variety of data sources, such as transactional
– subset of a data warehouse that contains data specific to systems, operational databases, and flat files. It then
a particular business line or department cleanses it, eliminates duplicates, and standardizes it to
– enable a department or business line to discover more- create a single source of the truth.
focused insights more quickly than possible when 2. Faster, business insights: Data warehouses enable data
working with the broader data warehouse data set integration, allowing business users to leverage all of a
– built from an existing data warehouse (or other data company’s data into each business decision.
sources) through a complex procedure that involves 3. Smarter decision-making: Data warehouse supports large-
multiple technologies and tools to design and construct scale BI functions such as data mining (finding unseen
a physical database, populate it with data, and set up patterns and relationships in data), artificial intelligence,
intricate access and management protocols and machine learning.
4. Gaining and growing competitive advantage: All of the
Database above combine to help an organization finding more
– an organized collection of structured information, or opportunities.
data, typically stored electronically in a computer system
– usually controlled by a database management system CHALLENGES WITH DATA WAREHOUSE ARCHITECTURE
(DBMS) • companies start housing more data and needing more
– used to store and manage large amounts of unstructured advanced analytics and a wide range of data, the data
data structured and , and they can be used to support a warehouse starts to become expensive and not so flexible
wide range of activities, including data storage, data • open data lakehouse allows you to run warehouse
analysis, and data management workloads on all kinds of data in an open and flexible
architecture (instead of a tightly coupled system it is much
❖ Relational Database more flexible and also can manage unstructured and semi-
o designed to capture and record data (OLTP) structured data like photos, videos, IoT data etc)
o live, real-time data • data lakehouse can also support your data science, ML
o data stored in tables with rows and columns and AI workloads in addition to your reporting and
o data is highly detailed dashboarding workloads (upgrade from data warehouse
o flexible schema (how data is organized) architecture, then developing an open data lakehouse is
the way to go)

TYPES OF DATA WAREHOUSE


1. ENTERPRISE DATA WAREHOUSE (EDW)
– a centralized warehouse that provides decision
support service across the enterprise
– also provide the ability to classify data according
to the subject and give access according to
divisions
2. OPERATIONAL DATA STORE
– nothing but data store required when neither
data warehouse nor OLTP systems support
organizations reporting needs
WHAT IS A DATA WAREHOUSE USED FOR?
3. DATA MART
✓ Airline
– subset of the data warehouse
✓ Banking
– specially designed for a particular line of
✓ Healthcare
business, such as sales, finance, sales or finance.
✓ Public Sector
Data can collect directly from sources
✓ Investment and insurance sector
✓ Retain chain
STAGES OF USE OF THE DATA WAREHOUSE
✓ Telecommunication
➢ OFFLINE OPERATIONAL DATABASE
✓ Hospitality industry
– data is just copied from an operational system to
another system (loading, processing, and reporting
STEPS TO IMPLEMENT DATA WAREHOUSE
of the copied data do not impact the operational
a) ENTERPRISE STRATEGY
system’s performance)
Identify technical including current architecture and tools
as well as facts, dimensions, and attributes (data mapping
➢ OFFLINE DATA WAREHOUSE
and transformation)
– data in the data warehouse is regularly updated from
the operational database
b) PHASED DELIVERY
– data in data warehouse is mapped and transformed
Data warehouse implementation should be phased based
to meet the data warehouse objectives
on subject areas. Related business entities like booking
and billing should be first implemented and then
➢ REAL TIME DATA WAREHOUSE
integrated with each other
– data warehouses are updated whenever any
transaction takes place in operational database c) INTERATIVE PROTOTYPING
– ex: Airline or railway booking system Data warehouse should be developed and tested
interatively
➢ INTEGRATED DATA WAREHOUSE
– data warehouses are updated continuously when
the operational system performs a transaction (data BEST PRACTICES TO IMPLEMENT A DATA WAREHOUSE
warehouse then generates transactions which are o The data warehouse must be well integrated, well defined
passed back to the operational system) and time stamped.
o While designing data warehouse make sure you use right
FOUR COMPONENTS OF DATA WAREHOUSE tool, stick to life cycle, take care about data conflicts and
Load Manager (the front components) ready to learn your mistakes.
– It performs with all the operations associated with o Never replace operational systems and reports.
the extraction and load of data into the warehouse. o Don’t spend too much time on extracting, cleaning and
– It prepares the data for entering into the DW loading data.
o Ensure to involve all stakeholders including business
Warehouse Manager personnel in data warehouse implementation process.
– performs operations associated with the o Prepare a training plan for the end users.
management of the data in the warehouse.
– It performs operations like analysis of data to ensure ADVANTAGES OF DATA WAREHOUSE
consistency, creation of indexes and views, ▪ Allows business users to quickly access critical data from
generation of denormalization and aggregations, some sources all in one place.
transformation and merging of source data and ▪ Provides consistent information on various cross-
archiving and back up data functional activities.
▪ Helps to integrate many sources of data to reduce stress
Query Manager (backend component) on the production systems.
– performs all the operations related to the ▪ Helps reduce total turnaround time for analysis and
management of user queries. reporting.
▪ Restructuring and integration make it easier for the user
End-User Access Tools to used for reporting and analysis.
– categorized into five different groups: ▪ Stores a large amount of historical data which helps users
1. Data reporting to analyze different time periods and trends to make
2. Query Tools future predictions
3. Application Development tools
4. Executive Information System (EIS) tools DISADVANTAGES OF DATA WAREHOUSE
5. OLAP  Not an ideal option for unstructured data.
 Creation and implementation of DW is surely time
WHO NEEDS DATA WAREHOUSE? confusing affair.
• Decision makers who rely on mass amount of data  Can be outdated relatively quickly.
• Users who use customized, complex processes to obtain  Difficult to make changes in data types and ranges, data
information from multiple data sources source schema, indexes, and queries.
• People who want simple technology to access the data  May seem easy, but actually, it is too complex for the
• People who want a systematic approach for making average users.
decision  Sometimes warehouse users will develop different
• Users who want a huge amount of data which is a business rules
necessity for reports, grids or charts
• DW is a first step if you want to discover ‘hidden patterns’
or data-flows and groupings
DATA WAREHOUSE TOOLS
MarkLogic
- useful data warehousing solution that makes data
integration easier and faster using an array of
enterprise. This tool helps to perform very complex
search operations. It can query different types of data
like documents, relationships, and metadata

Oracle
- the industry-leading database. If offers a wide range of
choice of data warehouse solutions for both on-
premises and in the cloud

Amazon Redshift
- a simple and cost-effective tool to analyze all types of
data using standard SQL and existing BI tools

You might also like