0% found this document useful (0 votes)
14 views15 pages

Data Warehouse 9 Oct

Data warehousing, conceptualized in the late 1980s, is a relational database management system designed to support decision-making by providing a centralized repository of integrated historical data. It features characteristics such as being subject-oriented, integrated, non-volatile, and time-variant, and includes components like ETL tools, metadata, and query tools. Various architectures exist for data warehouses, including single-tier, two-tier, and three-tier structures, with data marts serving as subsets tailored for specific business areas.

Uploaded by

soumyasree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views15 pages

Data Warehouse 9 Oct

Data warehousing, conceptualized in the late 1980s, is a relational database management system designed to support decision-making by providing a centralized repository of integrated historical data. It features characteristics such as being subject-oriented, integrated, non-volatile, and time-variant, and includes components like ETL tools, metadata, and query tools. Various architectures exist for data warehouses, including single-tier, two-tier, and three-tier structures, with data marts serving as subsets tailored for specific business areas.

Uploaded by

soumyasree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

DATA WAREHOUSE

The idea of data warehousing came to the late 1980's when IBM researchers Barry
Devlin and Paul Murphy established the "Business Data Warehouse."

In essence, the data warehousing idea was planned to support an architectural model for
the flow of information from the operational system to decisional support
environments. The concept attempt to address the various problems associated with
the flow, mainly the high costs associated with it.

Data Warehouse is a relational database management system (RDBMS) construct


to meet the requirement of transaction processing systems. It can be loosely
described as any centralized data repository which can be queried for business
benefits. It is a database that stores information oriented to satisfy decision-making
requests. So, Data Warehousing support architectures and tool for business executives
to systematically organize, understand and use their information to make strategic
decisions.

Data Warehouse environment contains an extraction, transportation, and loading


(ETL) solution, an online analytical processing (OLAP) engine, customer analysis
tools, and other applications that handle the process of gathering information and
delivering it to business users.

What is a Data Warehouse?


A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing. It includes historical data derived from transaction
data from single and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses


on providing support for decision-makers for data modeling and analysis.

Data Warehouse is a subject-oriented, integrated, non-volatile and time-variant


store of information in support of management's decisions."

Characteristics of Data Warehouse


Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that are not useful
concerning the subject and including all data needed by the users to understand the
subject.

Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat
files, and online transaction records. It requires performing data cleaning and integration
during data warehousing to ensure consistency in naming conventions, attributes types,
etc., among different data sources.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed
from the source operational RDBMS. Once the data is entered into the warehouse, it
should not be changed, ie, data within the warehouse should not be edited.

Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data warehouse.
These variations with a transactions system, where often only the most current file is
kept.

Database Data Warehouse

1. It is used for Online Transactional 1. It is used for Online Analytical


Processing (OLTP) but can be used for Processing (OLAP).
other objectives such as Data
Warehousing.
2. The tables and joins are complicated 2. The tables and joins are accessible
since they are normalized for RDBMS. This since they are de-normalized. This is
is done to reduce redundant files and to done to minimize the response time for
save storage space. analytical queries.

3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling procedures 4. Data: Modeling approach are used


are used for RDBMS database design. for the Data Warehouse design.

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis queries. 6. High performance for analytical


queries.

7. The database is the place where the 7. Data Warehouse is the place where
data is taken as a base and managed to the application data is handled for
get available fast and efficient access. analysis and reporting objectives.

Architecture of a Data Warehouse


Three common architectures are:

o Data Warehouse Architecture: Basic


o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic


 An operational system is a method used in data warehousing to refer to
a system that is used to process the day-to-day transactions of an organization.

 Flat Files: A Flat file system is a system of files in which transactional data is
stored, and every file in the system must have a different name.

 Meta Data: A set of data that defines and gives information about other data.

 Lightly and highly summarized data : The area of the data warehouse saves all
the predefined lightly and highly summarized (aggregated) data generated by the
warehouse manager. The goals of the summarized information are to speed
up query performance. The summarized record is updated continuously as new
information is loaded into the warehouse.

 End-User access Tools: The principal purpose of a data warehouse is to provide


information to the business managers for strategic decision-making. These
customers interact with the warehouse using end-client access tools.
Data
Warehouse Architecture: With Staging Area
We must clean and process your operational information before put it into the
warehouse. We can do this programmatically, although data warehouses uses a staging
area (A place where data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses where
all relevant data of an enterprise is consolidated.

Data Warehouse Staging Area is a temporary location where a record from source
systems is copied.
Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple groups within our
organization.

We can do this by adding data marts. A data mart is a segment of a data warehouses
that can provided information for reporting and analysis on a section, unit, department
or operation in the company, e.g., sales, payroll, production, etc.

The figure illustrates an example where purchasing, sales, and stocks are separated. In
this example, a financial analyst wants to analyze historical data for purchases and sales
or mine historical information to make predictions about customer behavior.
Types of Data Warehouse Architectures

Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the
amount of data stored to reach this goal; it removes data redundancies. Not effective for
organizations with large data needs and multiple streams.

Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier
architecture for a data warehouse system. While it is more effective at storing and
sorting the data, it is not scalable and supports minimum number of end users.

Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and data
warehouse.

The main advantage of the reconciled layer is that it creates a standard reference data
model for a whole enterprise. At the same time, it separates the problems of source data
extraction and integration from those of data warehouse population.

This architecture is especially useful for the extensive, enterprise-wide systems. A


disadvantage of this structure is the extra file storage space used through the extra
redundant reconciled layer. It also makes the analytical tools a little further away from
being real-time.

Data Warehouses usually have a three-level (tier) architecture that includes:

1. Bottom Tier (Data Warehouse Server)


2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).

1. A bottom-tier that consists of the Data Warehouse server, which is almost


always an RDBMS. It may include several specialized data marts and a metadata
repository.

Data from operational databases and external sources (such as user profile data
provided by external consultants) are extracted using application program
interfaces called a gateway. A gateway is provided by the underlying DBMS and
allows customer programs to generate SQL code to be executed at a server.

Examples of gateways contain ODBC (Open Database Connection) and OLE-


DB (Open-Linking and Embedding for Databases), by Microsoft, and JDBC (Java
Database Connection).

2. A middle-tier which consists of an OLAP server for fast querying of the data
warehouse.

The OLAP server is implemented using either

 A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that


maps functions on multidimensional data to standard relational operations.

 A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server


that directly implements multidimensional information and operations.

3. A top-tier that contains front-end tools for displaying results provided by OLAP,
as well as additional tools for data mining of the OLAP-generated data.
The metadata repository stores information that defines DW objects. It includes the
following parameters and information for the middle and the top-tier applications:

1. A description of the DW structure, including the warehouse schema, dimension,


hierarchies, data mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored
data, i.e., active, archived or purged, and warehouse monitoring information, i.e.,
usage statistics, error reports, audit, etc.
3. System performance data, which includes indices, used to improve data access
and retrieval performance.
4. Information about the mapping from operational databases, which provides
source RDBMSs and their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which
include business terms and definitions, ownership information, etc.

Datawarehouse Components
We will learn about the Datawarehouse Components and Architecture of Data
Warehouse with Diagram as shown below:

Data Warehouse Architecture

The Data Warehouse is based on an RDBMS server which is a central information


repository that is surrounded by some key Data Warehousing components to make
the entire environment functional, manageable and accessible.

There are mainly five Data Warehouse Components:


1. Data Warehouse Database
The central database is the foundation of the data warehousing environment. This
database is implemented on the RDBMS technology. Although, this kind of
implementation is constrained by the fact that traditional RDBMS system is optimized
for transactional database processing and not for data warehousing.

For instance, ad-hoc query, multi-table joins, aggregates are resource intensive and
slow down performance.

Hence, alternative approaches to Database are used as listed below-

 In a datawarehouse, relational databases are deployed in parallel to allow for


scalability. Parallel relational databases also allow shared memory on various
multiprocessor configurations or massively parallel processors.
 Use of multidimensional database (MDDBs) to overcome any limitations which
are placed because of the relational Data Warehouse Models.

2. Sourcing, Acquisition, Clean-up and Transformation Tools


(ETL)
The data sourcing, transformation, and migration tools are used for performing all the
conversions, summarizations, and all the changes needed to transform data into a
unified format in the data warehouse. They are also called Extract, Transform and
Load (ETL) Tools.

Their functionality includes:

 Eliminating unwanted data in operational databases from loading into Data


warehouse.
 Search and replace common names and definitions for data arriving from
different sources.
 In case of missing data, populate them with defaults.
 De-duplicated repeated data arriving from multiple data sources.

These Extract, Transform, and Load tools may generate cron jobs, background jobs,
Cobol programs, shell scripts, etc. that regularly update data in data warehouse.
These tools are also helpful to maintain the Metadata.

These ETL Tools have to deal with challenges of Database & Data heterogeneity.

3. Metadata
Metadata is data about data which defines the data warehouse. It is used for
building, maintaining and managing the data warehouse.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies
the source, usage, values, and features of data warehouse data. It also defines how
data can be changed and processed. It is closely connected to the data warehouse.

For example, a line in sales database may contain:


4030 KJ732 299.90
This is a meaningless data until we consult the Meta that tell us it was

 Model number: 4030


 Sales Agent ID: KJ732
 Total sales amount of $299.90

Therefore, Meta Data are essential ingredients in the transformation of data into
knowledge.

Metadata helps to answer the following questions

 What tables, attributes, and keys does the Data Warehouse contain?
 Where did the data come from?
 How many times do data get reloaded?
 What transformations were applied with cleansing?

Metadata can be classified into following categories:

1. Technical Meta Data: This kind of Metadata contains information about


warehouse which is used by Data warehouse designers and administrators.
2. Business Meta Data: This kind of Metadata contains detail that gives end-users a
way easy to understand information stored in the data warehouse.

4. Query Tools
One of the primary objects of data warehousing is to provide information to
businesses to make strategic decisions. Query tools allow users to interact with the
data warehouse system.

These tools fall into four different categories:

1. Query and reporting tools


2. Application Development tools
3. Data mining tools
4. OLAP tools
1.Query and reporting tools: Query and reporting tools can be further divided into
 Reporting tools
 Managed query tools

Reporting tools: can be further divided into production reporting tools and
desktop report writer.

1. Report writers: This kind of reporting tool are tools designed for end-users for
their analysis.
2. Production reporting: This kind of tools allows organizations to generate regular
operational reports. It also supports high volume batch jobs like printing and
calculating. Some popular reporting tools are Brio, Business Objects, Oracle,
PowerSoft, SAS Institute.

Managed query tools: This kind of access tools helps end users to resolve
problems in database and SQL and database structure by inserting meta-layer
between users and database.

2.Application development tools: Sometimes built-in graphical and analytical tools do


not satisfy the analytical needs of an organization. In such cases, custom reports are
developed using Application development tools.

3. Data mining tools: Data mining is a process of discovering meaningful new


correlation, pattens, and trends by mining large amount data. Data mining tools are
used to make this process automatic.
4. OLAP tools: These tools are based on concepts of a multidimensional database. It
allows users to analyse the data using elaborate and complex multidimensional views.
5. Data Marts
A data mart is an access layer which is used to get data out to the users. It is presented
as an option for large size data warehouse as it takes less time and money to build.
However, there is no standard definition of a data mart is differing from person to
person.

In a simple word Data mart is a subsidiary of a data warehouse. The data mart is used
for partition of data which is created for the specific group of users. Data marts could
be created in the same database as the Datawarehouse or a physically separate
Database.
Benefits of Data Warehousing:

What Is a Data Mart?

A data mart is a subset of a database —usually a data warehouse— where data is stored
for a specific business area. That is, a data mart stores concise and specific data sets used
for analysis for a specific department or line of business, such as the sales department.
The data mart is query-oriented and, as in a data warehouse, the data
has a clear structure —also typically star or snowflake dimensional models—. The intent of
using a data mart is to index data and facilitate queries about specific areas of the business
and to meet the needs of a particular group of users within the organization.
What is a data lake?

A data lake is a central storage repository that holds big data from many sources in a raw,
granular format. It can store structured, semi-structured, or unstructured data, which
means data can be kept in a more flexible format for future use. When storing data, a data
lake associates it with identifiers and metadata tags for faster retrieval.

A data lake works on a principle called schema-on-read. This means that there is no
predefined schema into which data needs to be fitted before storage. Only when the data is
read during processing is it parsed and adapted into a schema as needed. This feature
saves a lot of time that’s usually spent on defining a schema. This also enables data to be
stored as is, in any format.

https://youtu.be/GHpcLEkkmLc

You might also like