What is a Data Warehouse?
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than
transaction processing. It includes historical data derived from transaction data from single and
multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.
Data Warehouse is a relational database management system (RDBMS) construct to meet the
requirement of transaction processing systems. It can be loosely described as any centralized data
repository which can be queried for business benefits. It is a database that stores information
oriented to satisfy decision-making requests. It is a group of decision support technologies, targets to
enabling the knowledge worker (executive, manager, and analyst) to make superior and higher
decisions. So, Data Warehousing support architectures and tool for business executives to
systematically organize, understand and use their information to make strategic decisions.
Data Warehouse environment contains an extraction, transportation, and loading (ETL) solution, an
online analytical processing (OLAP) engine, customer analysis tools, and other applications that
handle the process of gathering information and delivering it to business users.
● A data warehouse is a database, which is kept separate from the organization's operational
database.
● There is no frequent updating done in a data warehouse.
● It possesses consolidated historical data, which helps the organization to analyze its business.
● A data warehouse helps executives to organize, understand, and use their data to take
strategic decisions.
● Data warehouse systems help in the integration of diversity of application systems.
● A data warehouse system helps in consolidated historical data analysis.
A Data Warehouse can be viewed as a data system with the following attributes:
● It is a database designed for investigative tasks, using data from various applications.
● It supports a relatively small number of clients with relatively long interactions.
● It includes current and historical data to provide a historical perspective of information.
● Its usage is read-intensive.
● It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of
management's decisions."
Data Warehouse Features
The key features of a data warehouse are discussed below −
● Subject Oriented − A data warehouse is subject oriented because it provides information
around a subject rather than the organization's ongoing operations. These subjects can be
product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the
ongoing operations, rather it focuses on modelling and analysis of data for decision making.
● Integrated − A data warehouse is constructed by integrating data from heterogeneous sources
such as relational databases, flat files, etc. This integration enhances the effective analysis of
data.
● Time Variant − The data collected in a data warehouse is identified with a particular time
period. The data in a data warehouse provides information from the historical point of view.
● Non-volatile − Non-volatile means the previous data is not erased when new data is added to
it. A data warehouse is kept separate from the operational database and therefore frequent
changes in operational database is not reflected in the data warehouse.
Data Warehouse Applications
As discussed before, a data warehouse helps business executives to organize, analyze, and use
their data for decision making. A data warehouse serves as a sole part of a plan-execute-assess
"closed-loop" feedback system for the enterprise management. Data warehouses are widely used in
the following fields −
● Financial services
● Banking services
● Consumer goods
● Retail sectors
● Controlled manufacturing
Types of Data Warehouse
Information processing, analytical processing, and data mining are the three types of data
warehouse applications that are discussed below −
● Information Processing − A data warehouse allows to process the data stored in it. The data
can be processed by means of querying, basic statistical analysis, reporting using crosstabs,
tables, charts, or graphs.
● Analytical Processing − A data warehouse supports analytical processing of the information
stored in it. The data can be analyzed by means of basic OLAP operations, including
slice-and-dice, drill down, drill up, and pivoting.
● Data Mining − Data mining supports knowledge discovery by finding hidden patterns and
associations, constructing analytical models, performing classification and prediction. These
mining results can be presented using the visualization tools.
Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)
1 It involves historical processing of It involves day-to-day processing.
information.
2 OLAP systems are used by OLTP systems are used by clerks, DBAs, or
knowledge workers such as database professionals.
executives, managers, and
analysts.
3 It is used to analyze the business. It is used to run the business.
4 It focuses on Information out. It focuses on Data in.
5 It is based on Star Schema, It is based on Entity Relationship Model.
Snowflake Schema, and Fact
Constellation Schema.
6 It focuses on Information out. It is application oriented.
7 It contains historical data. It contains current data.
8 It provides summarized and It provides primitive and highly detailed data.
consolidated data.
9 It provides summarized and It provides detailed and flat relational view of data.
multidimensional view of data.
10 The number of users is in hundreds. The number of users is in thousands.
11 The number of records accessed is in The number of records accessed is in tens.
millions.
12 The database size is from 100GB to The database size is from 100 MB to 100 GB.
100 TB.
13 These are highly flexible. It provides high performance.
Need for Data Warehouse
1. 1) Business User: Business users require a data warehouse to view summarized data from the
past. Since these people are non-technical, the data may be presented to them in an
elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time variable data from the
past. This input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and consistency in
data.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected loads and
types of queries, which demands a significant degree of flexibility and quick response time.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate, understand,
and query.
4. Queries that would be complex in many normalized databases could be easier to build and
maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from lots
of users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.
Difference between Operational Database and Data
Warehouse
Operational Database Data Warehouse
Operational systems are designed to support Data warehousing systems are typically
high-volume transaction processing. designed to support high-volume analytical
processing (i.e., OLAP).
Operational systems are usually concerned with Data warehousing systems are usually
current data. concerned with historical data.
Data within operational systems are mainly Non-volatile, new data may be added regularly.
updated regularly according to need. Once Added rarely changed.
It is designed for real-time business dealing and It is designed for analysis of business
processes. measures by subject area, categories, and
attributes.
It is optimized for a simple set of transactions, It is optimized for extent loads and high,
generally adding or retrieving a single row at a complex, unpredictable queries that access
time per table. many rows per table.
It is optimized for validation of incoming Loaded with consistent, valid information,
information during transactions, uses validation requires no real-time validation.
data tables.
It supports thousands of concurrent clients. It supports a few concurrent clients relative to
OLTP.
Operational systems are widely process-oriented. Data warehousing systems are widely
subject-oriented
Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of associatively optimized to perform fast retrievals of
small volumes of data. relatively high volumes of data.
Data In Data Out
Less Number of data accessed. Large Number of data accessed.
Relational databases are created for on-line Data Warehouse designed for on-line
transactional Processing (OLTP) Analytical Processing (OLAP)
Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of data communication
processing and presentation that exist for end-clients computing within the enterprise. Each data
warehouse is different, but all are characterized by standard vital components.
Production applications such as payroll accounts payable product purchasing and inventory control
are designed for online transaction processing (OLTP). Such applications gather detailed data from
day to day operations.
Data Warehouse Staging Area is a temporary location where a record from source systems is copied.
Architecture is the proper arrangement of the elements. We build a data warehouse with software and
hardware components. To suit the requirements of our organizations, we arrange these building we
may want to boost up another part with extra tools and services. All of these depends on our
circumstances.
A data warehouses uses a staging area (A place where data is processed before entering the
warehouse).
Types of Data Warehouse Architectures
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the amount of
data stored to reach this goal; it removes data redundancies.
The figure shows the only layer physically available is the source layer. In this method, data
warehouses are virtual. This means that the data warehouse is implemented as a multidimensional
view of operational data created by specific middleware, or an intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for separation between
analytical and transactional processing. Analysis queries are agreed to operational data after the
middleware interprets them. In this way, queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture for a data
warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a separation between physically
available sources and data warehouses, in fact, consists of four subsequent data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is
stored initially to corporate relational databases or legacy databases, or it may come from an
information system outside the corporate walls.
1. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one
standard schema. The so-named Extraction, Transformation, and Loading Tools (ETL) can
combine heterogeneous schemata, extract, transform, cleanse, validate, filter, and load source
data into a data warehouse.
2. Data Warehouse layer: Information is saved to one logically centralized individual repository: a
data warehouse. The data warehouses can be directly accessed, but it can also be used as a
source for creating data marts, which partially replicate data warehouse contents and are
designed for specific enterprise departments. Meta-data repositories store information on
sources, access procedures, data staging, users, data mart schema, and so on.
3. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should
feature aggregate information navigators, complex query optimizers, and customer-friendly
GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source system), the
reconciled layer and the data warehouse layer (containing both data warehouses and data marts).
The reconciled layer sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for a
whole enterprise. At the same time, it separates the problems of source data extraction and
integration from those of data warehouse population. In some cases, the reconciled layer is also
directly used to accomplish better some operational tasks, such as producing daily reports that
cannot be satisfactorily prepared using the corporate applications or generating data flows to feed
external processes periodically to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A disadvantage of
this structure is the extra file storage space used through the extra redundant reconciled layer. It also
makes the analytical tools a little further away from being real-time.
Metadata
Metadata is simply defined as data about data. The data that are used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in the
book. In other words, we can say that metadata is the summarized data that leads us to the detailed
data.
In terms of data warehouse, we can define metadata as following −
● Metadata is a road-map to data warehouse.
● Metadata in data warehouse defines the warehouse objects.
● Metadata acts as a directory. This directory helps the decision support system to locate the
contents of a data warehouse.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It contains the following
metadata −
● Business metadata − It contains the data ownership information, business definition, and
changing policies.
● Operational metadata − It includes currency of data and data lineage. Currency of data refers
to the data being active, archived, or purged. Lineage of data means history of data migrated
and transformation applied on it.
● Data for mapping from operational environment to data warehouse − It metadata includes
source databases and their contents, data extraction, data partition, cleaning, transformation
rules, data refresh and purging rules.
● The algorithms for summarization − It includes dimension algorithms, data on granularity,
aggregation, summarizing, etc.
Data Cube
A data cube helps us represent data in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the entities with respect to which an enterprise preserves the records.
Illustration of Data Cube
Suppose a company wants to keep track of sales records with the help of sales data warehouse with
respect to time, item, branch, and location. These dimensions allow to keep track of monthly sales
and at which branch the items were sold. There is a table associated with each dimension. This table
is known as dimension table. For example, "item" dimension table may have attributes such as
item_name, item_type, and item_brand.
The following table represents the 2-D view of Sales Data for a company with respect to time, item,
and location dimensions.
But here in this 2-D table, we have records with respect to time and item only. The sales for New
Delhi are shown with respect to time, and item dimensions according to type of items sold. If we
want to view the sales data with one more dimension, say, the location dimension, then the 3-D view
would be useful. The 3-D view of the sales data with respect to time, item, and location is shown in
the table below −
The above 3-D table can be represented as 3-D data cube as shown in the following figure −
Data Mart
Data marts contain a subset of organization-wide data that is valuable to specific groups of people in
an organization. In other words, a data mart contains only those data that is specific to a particular
group. For example, the marketing data mart may contain only data related to items, customers, and
sales. Data marts are confined to subjects.
Points to Remember About Data Marts
● Windows-based or Unix/Linux-based servers are used to implement data marts. They are
implemented on low-cost servers.
● The implementation cycle of a data mart is measured in short periods of time, i.e., in weeks
rather than months or years.
● The life cycle of data marts may be complex in the long run, if their planning and design are
not organization-wide.
● Data marts are small in size.
● Data marts are customized by department.
● The source of a data mart is departmentally structured data warehouse.
● Data marts are flexible.
The following figure shows a graphical representation of data marts.
OLTP System
OLTP System handle with operational data. Operational data are those data contained in the operation
of a particular system. Example, ATM transactions and Bank transactions, etc.
OLAP System
OLAP handle with Historical Data or Archival Data. Historical data are those data that are achieved
over a long period. For example, if we collect the last 10 years information about flight reservation, the
data can give us much meaningful data such as the trends in the reservation. This may provide useful
information like peak time of travel, what kind of people are traveling in various classes
(Economy/Business) etc.
The major difference between an OLTP and OLAP system is the amount of data analyzed in a single
transaction. Whereas an OLTP manage many concurrent customers and queries touching only an
individual record or limited groups of files at a time. An OLAP system must have the capability to
operate on millions of files to answer a single query.
Feature OLTP OLAP
Characteristic It is a system which is used to It is a system which is used to manage
manage operational Data. informational Data.
Users Clerks, clients, and information Knowledge workers, including managers,
technology professionals. executives, and analysts.
System OLTP system is a OLAP system is market-oriented, knowledge
orientation customer-oriented, transaction, workers including managers, do data analysts
and query processing are done executive and analysts.
by clerks, clients, and
information technology
professionals.
Data contents OLTP system manages current OLAP system manages a large amount of
data that too detailed and are historical data, provides facilitates for
used for decision making. summarization and aggregation, and stores
and manages data at different levels of
granularity. This information makes the data
more comfortable to use in informed decision
making.
Database Size 100 MB-GB 100 GB-TB
Database OLTP system usually uses an OLAP system typically uses either a star or
design entity-relationship (ER) data snowflake model and subject-oriented
model and application-oriented database design.
database design.
View OLTP system focuses primarily OLAP system often spans multiple versions of
on the current data within an a database schema, due to the evolutionary
enterprise or department, process of an organization. OLAP systems
without referring to historical also deal with data that originates from
information or data in different various organizations, integrating information
organizations. from many data stores.
Volume of data Not very large Because of their large volume, OLAP data are
stored on multiple storage media.
Access The access patterns of an OLTP Accesses to OLAP systems are mostly
patterns system subsist mainly of short, read-only methods because of these data
atomic transactions. Such a warehouses stores historical data.
system requires concurrency
control and recovery techniques.
Access mode Read/write Mostly write
Insert and Short and fast inserts and Periodic long-running batch jobs refresh the
Updates updates proposed by end-users. data.
Number of Tens Millions
records
accessed
Normalization Fully Normalized Partially Normalized
Processing Very Fast It depends on the amount of files contained,
Speed batch data refresh, and complex query may
take many hours, and query speed can be
upgraded by creating indexes.