0% found this document useful (0 votes)
13 views45 pages

DWDM U-1

Uploaded by

Kanishk Goswami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views45 pages

DWDM U-1

Uploaded by

Kanishk Goswami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

DWDM

Unit-1

What Is a Data Warehouse

Data warehouses serve as a central repository for storing and analyzing information to make
better informed decisions. An organization's data warehouse receives data from a variety of
sources, typically on a regular basis, including transactional systems, relational databases, and
other sources.A data warehouse is a centralized storage system that allows for the storing,
analyzing, and interpreting of data in order to facilitate better decision-making. Transactional
systems, relational databases, and other sources provide data into data warehouses on a regular
basis.A data warehouse is a type of data management system that facilitates and
supports business intelligence (BI) activities, specifically analysis. Data warehouses are
primarily designed to facilitate searches and analyses and usually contain large amounts of
historical data.

A data warehouse can be defined as a collection of organizational data and information extracted
from operational sources and external data sources. The data is periodically pulled from various
internal applications like sales, marketing, and finance; customer-interface applications; as well
as external partner systems. This data is then made available for decision-makers to access and
analyze. So what is data warehouse? For a start, it is a comprehensive repository of current and
historical information that is designed to enhance an organization’s performance.

Key Characteristics of Data Warehouse

The main characteristics of a data warehouse are as follows:

 Subject-Oriented

A data warehouse is subject-oriented since it provides topic-wise information rather than the
overall processes of a business. Such subjects may be sales, promotion, inventory, etc. For
example, if you want to analyze your company’s sales data, you need to build a data warehouse
that concentrates on sales. Such a warehouse would provide valuable information like ‘who was
your best customer last year?’ or ‘who is likely to be your best customer in the coming year?’

 Integrated
A data warehouse is developed by integrating data from varied sources into a consistent format.
The data must be stored in the warehouse in a consistent and universally acceptable manner in
terms of naming, format, and coding. This facilitates effective data analysis.

 Non-Volatile

Data once entered into a data warehouse must remain unchanged. All data is read-only. Previous
data is not erased when current data is entered. This helps you to analyze what has happened and
when.

 Time-Variant

The data stored in a data warehouse is documented with an element of time, either explicitly or
implicitly. An example of time variance in Data Warehouse is exhibited in the Primary Key,
which must have an element of time like the day, week, or month.

Database vs. Data Warehouse

Although a data warehouse and a traditional database share some similarities, they need not be
the same idea. The main difference is that in a database, data is collected for multiple
transactional purposes. However, in a data warehouse, data is collected on an extensive scale to
perform analytics. Databases provide real-time data, while warehouses store data to be accessed
for big analytical queries.

Data warehouse is an example of an OLAP system or an online database query answering


system. OLTP is an online database modifying system, for example, ATM.

Data Warehouse Architecture

Usually, data warehouse architecture comprises a three-tier structure.

Bottom Tier

The bottom tier or data warehouse server usually represents a relational database system. Back-
end tools are used to cleanse, transform and feed data into this layer.
Middle Tier

The middle tier represents an OLAP server that can be implemented in two ways.

The ROLAP or Relational OLAP model is an extended relational database management system
that maps multidimensional data process to standard relational process.

The MOLAP or multidimensional OLAP directly acts on multidimensional data and operations.

Top Tier

This is the front-end client interface that gets data out from the data warehouse. It holds various
tools like query tools, analysis tools, reporting tools, and data mining tools.

How Data Warehouse Works

Data Warehousing integrates data and information collected from various sources into one
comprehensive database. For example, a data warehouse might combine customer information
from an organization’s point-of-sale systems, its mailing lists, website, and comment cards. It
might also incorporate confidential information about employees, salary information, etc.
Businesses use such components of data warehouse to analyze customers.

Data mining is one of the features of a data warehouse that involves looking for meaningful data
patterns in vast volumes of data and devising innovative strategies for increased sales and
profits.

Types of Data Warehouse

There are three main types of data warehouse.

Enterprise Data Warehouse (EDW)

This type of warehouse serves as a key or central database that facilitates decision-support
services throughout the enterprise. The advantage to this type of warehouse is that it provides
access to cross-organizational information, offers a unified approach to data representation, and
allows running complex queries.
Operational Data Store (ODS)

This type of data warehouse refreshes in real-time. It is often preferred for routine activities like
storing employee records. It is required when data warehouse systems do not support reporting
needs of the business.

Data Mart

A data mart is a subset of a data warehouse built to maintain a particular department, region, or
business unit. Every department of a business has a central repository or data mart to store data.
The data from the data mart is stored in the ODS periodically. The ODS then sends the data to
the EDW, where it is stored and used.

Data Warehouse Example

Let us look at some examples of how companies use data warehouse as an integral part of their
day-to-day operations.

Investment and Insurance companies use data warehouses to primarily analyze customer and
market trends and allied data patterns. In sub-sectors like Forex and stock markets, data
warehouse plays a significant role because a single point difference can result in huge losses
across the board.

Retail chains use data warehouses for marketing and distribution, so they can track items,
examine pricing policies and analyze buying trends of customers. They use data warehouse
models for business intelligence and forecasting needs.

Healthcare companies, on the other hand, use data warehouse concepts to generate treatment
reports, share data with insurance companies and in research and medical units. Healthcare
systems depend heavily upon enterprise data warehouses because they need the latest, updated
treatment information to save lives.

Data Warehousing Tools

Wondering what Data warehouse tools is? Well, these are software components used to perform
several operations on an extensive data set. These tools help to collect, read, write and transfer
data from various sources. What do data warehouses support? They are designed to support
operations like data sorting, filtering, merging, etc.
Data warehouse applications can be categorized as:

 Query and reporting tools

 Application Development tools

 Data mining tools

 OLAP tools

Some popular data warehouse tools are Xplenty, Amazon Redshift, Teradata, Oracle 12c,
Informatica, IBM Infosphere, Cloudera, and Panoply.

Benefits of Data Warehouse

Wondering why businesses need data warehousing? Well, there are several benefits of data
warehouse for end users.

 Improved data consistency

 Better business decisions

 Easier access to enterprise data for end-users

 Better documentation of data

 Reduced computer costs and higher productivity

 Enabling end-users to ask ad-hoc queries or reports without deterring the performance of
operational systems

 Collection of related data from various sources into a place

Companies having dedicated Data Warehouse teams emerge ahead of others in key areas of
product development, pricing, marketing, production time, historical analysis, forecasting, and
customer satisfaction. Though data warehouses can be slightly expensive, they pay in the long
run.

Components or Building Blocks of Data Warehouse

Architecture is the proper arrangement of the elements. We build a data warehouse with software
and hardware components. To suit the requirements of our organizations, we arrange these
building we may want to boost up another part with extra tools and services. All of these depends
on our circumstances.
The figure shows the essential elements of a typical warehouse. We see the Source Data
component shows on the left. The Data staging element serves as the next building block. In the
middle, we see the Data Storage component that handles the data warehouses data. This element
not only stores and manages the data; it also keeps track of data using the metadata repository.
The Information Delivery component shows on the right consists of all the different ways of
making the information from the data warehouses available to the users.

Source Data Component

Source data coming into the data warehouses may be grouped into four broad categories:

Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments of the
data from the various operational modes.

Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data, part of
which could be useful in a data warehouse.

Archived Data: Operational systems are mainly intended to run the current business. In every
operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry produced
by the external department.

Data Staging Component

After we have been extracted data from various operational systems and external sources, we
have to prepare the files for storing in the data warehouse. The extracted data coming from
several different sources need to be changed, converted, and made ready in a format that is
relevant to be saved for querying and analysis.

We will n ow discuss the three primary


functions that take place in the staging area.

1) Data Extraction: This method has to deal with numerous data sources. We have to employ
the appropriate techniques for each data source.

2) Data Transformation: As we know, data for a data warehouse comes from many different
sources. If data extraction for a data warehouse posture big challenges, data transformation
present even significant challenges. We perform several individual tasks as part of data
transformation.

First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or elimination
of duplicates when we bring in the same data from various source systems.

Standardization of data components forms a large part of data transformation. Data


transformation contains many forms of combining pieces of data from different sources. We
combine data from single source record or related data parts from many source records.
On the other hand, data transformation also contains purging source data that is not useful and
separating outsource records into new combinations. Sorting and merging of data take place on a
large scale in the data staging area. When the data transformation function ends, we have a
collection of integrated data that is cleaned, standardized, and summarized.

3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first time, we
do the initial loading of the information into the data warehouse storage. The initial load moves
high volumes of data using up a substantial amount of time.

Data Storage Components

Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data repositories include
the data structured in highly normalized for fast and efficient processing.

Information Delivery Component

The information delivery element is used to enable the process of subscribing for data warehouse
files and having it transferred to one or more destinations according to some customer-specified
scheduling algorithm.

Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the data about the logical data structures,
the data about the records and addresses, the information about the indexes, and so on.

Data Marts

It includes a subset of corporate-wide data that is of value to a specific group of users. The scope
is confined to particular selected subjects. Data in a data warehouse should be a fairly current,
but not mainly up to the minute, although development in the data warehouse industry has made
standard and incremental data dumps more achievable. Data marts are lower than data
warehouses and usually contain organization. The current trends in data warehousing are to
developed a data warehouse with several smaller related data marts for particular kinds of
queries and reports.

Management and Control Component

The management and control elements coordinate the services and functions within the data
warehouse. These components control the data transformation and the data transfer into the data
warehouse storage. On the other hand, it moderates the data delivery to the clients. Its work with
the database management systems and authorizes data to be correctly saved in the repositories. It
monitors the movement of information into the staging method and from there into the data
warehouses storage itself.

Why we need a separate Data Warehouse?

Data Warehouse queries are complex because they involve the computation of large groups of
data at summarized levels.

It may require the use of distinctive data organization, access, and implementation method based
on multidimensional views.

Performing OLAP queries in operational database degrade the performance of functional tasks.

Database Data Warehouse

1. It is used for Online Transactional Processing (OLTP) but 1. It is used for Online Analytical Processing
can be used for other objectives such as Data Warehousing. (OLAP). This reads the historical information
This records the data from the clients for history. for the customers for business decisions.
2. The tables and joins are complicated since they are 2. The tables and joins are accessible since they
normalized for RDBMS. This is done to reduce redundant are de-normalized. This is done to minimize the
files and to save storage space. response time for analytical queries.

3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling procedures are used for 4. Data: Modeling approach are used for the
RDBMS database design. Data Warehouse design.

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis queries. 6. High performance for analytical queries.

7. The database is the place where the data is taken as a base 7. Data Warehouse is the place where the
and managed to get available fast and efficient access. application data is handled for analysis and
reporting objectives.

Data Warehouse is used for analysis and decision making in which extensive database is
required, including historical data, which operational database does not typically maintain.

The separation of an operational database from data warehouses is based on the different
structures and uses of data in these systems.

Because the two systems provide different functionalities and require different kinds of data, it is
necessary to maintain separate databases.

Difference between Database and Data Warehouse

Database Data Warehouse

1. It is used for Online Transactional Processing (OLTP) but 1. It is used for Online Analytical Processing
can be used for other objectives such as Data Warehousing. (OLAP). This reads the historical information
This records the data from the clients for history. for the customers for business decisions.

2. The tables and joins are complicated since they are 2. The tables and joins are accessible since they
normalized for RDBMS. This is done to reduce redundant are de-normalized. This is done to minimize the
files and to save storage space. response time for analytical queries.
3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling procedures are used for 4. Data: Modeling approach are used for the
RDBMS database design. Data Warehouse design.

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis queries. 6. High performance for analytical queries.

7. The database is the place where the data is taken as a base 7. Data Warehouse is the place where the
and managed to get available fast and efficient access. application data is handled for analysis and
reporting objectives.

Difference between Operational Database and Data Warehouse

The Operational Database is the source of information for the data warehouse. It includes
detailed information used to run the day to day operations of the business. The data frequently
changes as updates are made and reflect the current value of the last transactions.

Operational Database Management Systems also called as OLTP (Online Transactions


Processing Databases), are used to manage dynamic data in real-time.

Data Warehouse Systems serve users or knowledge workers in the purpose of data analysis and
decision-making. Such systems can organize and present information in specific formats to
accommodate the diverse needs of various users. These systems are called as Online-Analytical
Processing (OLAP) Systems.

Data Warehouse and the OLTP database are both relational databases. However, the goals of
both these databases are different.

Operational Database Data Warehouse

Operational systems are designed to support high- Data warehousing systems are typically designed to
volume transaction processing. support high-volume analytical processing (i.e.,
OLAP).

Operational systems are usually concerned with current Data warehousing systems are usually concerned
data. with historical data.
Data within operational systems are mainly updated Non-volatile, new data may be added regularly.
regularly according to need. Once Added rarely changed.

It is designed for real-time business dealing and It is designed for analysis of business measures by
processes. subject area, categories, and attributes.

It is optimized for a simple set of transactions, generally It is optimized for extent loads and high, complex,
adding or retrieving a single row at a time per table. unpredictable queries that access many rows per
table.

It is optimized for validation of incoming information Loaded with consistent, valid information, requires
during transactions, uses validation data tables. no real-time validation.

It supports thousands of concurrent clients. It supports a few concurrent clients relative to


OLTP.

Operational systems are widely process-oriented. Data warehousing systems are widely subject-
oriented

Operational systems are usually optimized to perform Data warehousing systems are usually optimized to
fast inserts and updates of associatively small volumes perform fast retrievals of relatively high volumes of
of data. data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Relational databases are created for on-line transactional Data Warehouse designed for on-line Analytical
Processing (OLTP) Processing (OLAP)

Difference between OLTP and OLAP


OLTP System

OLTP System handle with operational data. Operational data are those data contained in the
operation of a particular system. Example, ATM transactions and Bank transactions, etc.

OLAP System

OLAP handle with Historical Data or Archival Data. Historical data are those data that are
achieved over a long period. For example, if we collect the last 10 years information about flight
reservation, the data can give us much meaningful data such as the trends in the reservation. This
may provide useful information like peak time of travel, what kind of people are traveling in
various classes (Economy/Business) etc.

The major difference between an OLTP and OLAP system is the amount of data analyzed in a
single transaction. Whereas an OLTP manage many concurrent customers and queries touching
only an individual record or limited groups of files at a time. An OLAP system must have the
capability to operate on millions of files to answer a single query.

Feature OLTP OLAP

Characteristic It is a system which is used to It is a system which is used to manage informational


manage operational Data. Data.

Users Clerks, clients, and information Knowledge workers, including managers, executives,
technology professionals. and analysts.

System OLTP system is a customer-oriented, OLAP system is market-oriented, knowledge


orientation transaction, and query processing are workers including managers, do data analysts
done by clerks, clients, and executive and analysts.
information technology
professionals.

Data contents OLTP system manages current data OLAP system manages a large amount of historical
that too detailed and are used for data, provides facilitates for summarization and
decision making. aggregation, and stores and manages data at different
levels of granularity. This information makes the
data more comfortable to use in informed decision
making.
Database Size 100 MB-GB 100 GB-TB

Database design OLTP system usually uses an entity- OLAP system typically uses either a star or
relationship (ER) data model and snowflake model and subject-oriented database
application-oriented database design. design.

View OLTP system focuses primarily on OLAP system often spans multiple versions of a
the current data within an enterprise database schema, due to the evolutionary process of
or department, without referring to an organization. OLAP systems also deal with data
historical information or data in that originates from various organizations,
different organizations. integrating information from many data stores.

Volume of data Not very large Because of their large volume, OLAP data are stored
on multiple storage media.

Access patterns The access patterns of an OLTP Accesses to OLAP systems are mostly read-only
system subsist mainly of short, methods because of these data warehouses stores
atomic transactions. Such a system historical data.
requires concurrency control and
recovery techniques.

Access mode Read/write Mostly write

Insert and Short and fast inserts and updates Periodic long-running batch jobs refresh the data.
Updates proposed by end-users.

Number of Tens Millions


records
accessed

Normalization Fully Normalized Partially Normalized

Processing Very Fast It depends on the amount of files contained, batch


Speed data refresh, and complex query may take many
hours, and query speed can be upgraded by creating
indexes.

Data Warehouse Architecture

A data warehouse architecture is a method of defining the overall architecture of data


communication processing and presentation that exist for end-clients computing within the
enterprise. Each data warehouse is different, but all are characterized by standard vital
components.

Production applications such as payroll accounts payable product purchasing and inventory
control are designed for online transaction processing (OLTP). Such applications gather detailed
data from day to day operations.

Data Warehouse applications are designed to support the user ad-hoc data requirements, an
activity recently dubbed online analytical processing (OLAP). These include applications such as
forecasting, profiling, summary reporting, and trend analysis.

Production databases are updated continuously by either by hand or via OLTP applications. In
contrast, a warehouse database is updated from operational systems periodically, usually during
off-hours. As OLTP data accumulates in production databases, it is regularly extracted, filtered,
and then loaded into a dedicated warehouse server that is accessible to users. As the warehouse is
populated, it must be restructured tables de-normalized, data cleansed of errors and redundancies
and new fields and keys added to reflect the needs to the user for sorting, combining, and
summarizing data.

Data warehouses and their architectures very depending upon the elements of an organization's
situation.

Three common architectures are:

o Data Warehouse Architecture: Basic


o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic

Operational System
An operational system is a method used in data warehousing to refer to a system that is used to
process the day-to-day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and every file in the
system must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make finding and work with
particular instances of data more accessible. For example, author, data build, and data changed,
and file size are examples of very basic document metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.

The goals of the summarized information are to speed up query performance. The summarized
record is updated continuously as new information is loaded into the warehouse.

End-User access Tools

The principal purpose of a data warehouse is to provide information to the business managers for
strategic decision-making. These customers interact with the warehouse using end-client access
tools.

The examples of some of the end-user access tools can be:

o Reporting and Query Tools


o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools

Data Warehouse Architecture: With Staging Area


We must clean and process your operational information before put it into the warehouse.

e can do this programmatically, although data warehouses uses a staging area (A place where
data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational method coming from
multiple source systems, especially for enterprise data warehouses where all relevant data of an
enterprise is consolidated.

Data Warehouse Staging Area is a temporary location where a record from source systems is
copied.

Data Warehouse Architecture: With Staging Area and Data Marts

We may want to customize our warehouse's architecture for multiple groups within our
organization.

We can do this by adding data marts. A data mart is a segment of a data warehouses that can
provided information for reporting and analysis on a section, unit, department or operation in the
company, e.g., sales, payroll, production, etc.

The figure illustrates an example where purchasing, sales, and stocks are separated. In this
example, a financial analyst wants to analyze historical data for purchases and sales or mine
historical information to make predictions about customer behavior.

Properties of Data Warehouse Architectures

The following architecture properties are necessary for a data warehouse system:

1. Separation: Analytical and transactional processing should be keep apart as much as possible.
2. Scalability: Hardware and software architectures should be simple to upgrade the data
volume, which has to be managed and processed, and the number of user's requirements, which
have to be met, progressively increase.

3. Extensibility: The architecture should be able to perform new operations and technologies
without redesigning the whole system.

4. Security: Monitoring accesses are necessary because of the strategic data stored in the data
warehouses.

5. Administerability: Data Warehouse management should not be complicated.

Types of Data Warehouse Architectures

Single-Tier Architecture

Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the
amount of data stored to reach this goal; it removes data redundancies.

The figure shows the only layer physically available is the source layer. In this method, data
warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an intermediate
processing layer.

The vulnerability of this architecture lies in its failure to meet the requirement for separation
between analytical and transactional processing. Analysis queries are agreed to operational data
after the middleware interprets them. In this way, queries affect transactional workloads.

Two-Tier Architecture

The requirement for separation plays an essential role in defining the two-tier architecture for a
data warehouse system, as shown in fig:

Although it is typically called two-layer architecture to highlight a separation between physically


available sources and data warehouses, in fact, consists of four subsequent data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is
stored initially to corporate relational databases or legacy databases, or it may come from
an information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one
standard schema. The so-named Extraction, Transformation, and Loading Tools
(ETL) can combine heterogeneous schemata, extract, transform, cleanse, validate, filter,
and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual
repository: a data warehouse. The data warehouses can be directly accessed, but it can
also be used as a source for creating data marts, which partially replicate data warehouse
contents and are designed for specific enterprise departments. Meta-data repositories
store information on sources, access procedures, data staging, users, data mart schema,
and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue
reports, dynamically analyze information, and simulate hypothetical business scenarios. It
should feature aggregate information navigators, complex query optimizers, and
customer-friendly GUIs.

Three-Tier Architecture

The three-tier architecture consists of the source layer (containing multiple source system), the
reconciled layer and the data warehouse layer (containing both data warehouses and data marts).
The reconciled layer sits between the source data and data warehouse.

The main advantage of the reconciled layer is that it creates a standard reference data model for
a whole enterprise. At the same time, it separates the problems of source data extraction and
integration from those of data warehouse population. In some cases, the reconciled layer is also
directly used to accomplish better some operational tasks, such as producing daily reports that
cannot be satisfactorily prepared using the corporate applications or generating data flows to feed
external processes periodically to benefit from cleaning and integration.

This architecture is especially useful for the extensive, enterprise-wide systems. A disadvantage
of this structure is the extra file storage space used through the extra redundant reconciled layer.
It also makes the analytical tools a little further away from being real-time.
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.

The OLAP server is implemented using either

(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps
functions on multidimensional data to standard relational operations.

(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.

A top-tier that contains front-end tools for displaying results provided by OLAP, as well as
additional tools for data mining of the OLAP-generated data.

The overall Data Warehouse Architecture is shown in fig:


The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:

1. A description of the DW structure, including the warehouse schema, dimension,


hierarchies, data mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored data, i.e.,
active, archived or purged, and warehouse monitoring information, i.e., usage statistics,
error reports, audit, etc.
3. System performance data, which includes indices, used to improve data access and
retrieval performance.
4. Information about the mapping from operational databases, which provides
source RDBMSs and their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which include
business terms and definitions, ownership information, etc.

Principles of Data Warehousing

Load Performance

Data warehouses require increase loading of new data periodically basis within narrow time
windows; performance on the load process should be measured in hundreds of millions of rows
and gigabytes per hour and must not artificially constrain the volume of data business.

Load Processing

Many phases must be taken to load new or update data into the data warehouse, including data
conversion, filtering, reformatting, indexing, and metadata update.

Data Quality Management

Fact-based management demands the highest data quality. The warehouse ensures local
consistency, global consistency, and referential integrity despite "dirty" sources and massive
database size.
Query Performance

Fact-based management must not be slowed by the performance of the data warehouse RDBMS;
large, complex queries must be complete in seconds, not days.

Terabyte Scalability

Data warehouse sizes are growing at astonishing rates. Today these size from a few to hundreds
of gigabytes and terabyte-sized data warehouses.

ETL (Extract, Transform, and Load) Process

What is ETL?

The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation and
Loading.

The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.

To maintain its value as a tool for decision-makers, Data warehouse technique needs to change
with business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse
system and needs to be agile, automated, and well documented.
How ETL Works?

ETL consists of three separate phases:

Extraction

o Extraction is the operation of extracting information from a source system for further use
in a data warehouse environment. This is the first stage of the ETL process.
o Extraction process is often one of the most time-consuming tasks in the ETL.
o The source systems might be complicated and poorly documented, and thus determining
which data needs to be extracted can be difficult.
o The data has to be extracted several times in a periodic manner to supply all changed data
to the warehouse and keep it up-to-date.

Cleansing

The cleansing stage is crucial in a data warehouse technique because it is supposed to improve
data quality. The primary data cleansing features found in ETL tools are rectification and
homogenization. They use specific dictionaries to rectify typing mistakes and to recognize
synonyms, as well as rule-based cleansing to enforce domain-specific rules and defines
appropriate associations between values.
The following examples show the essential of data cleaning:

If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date
list of contact addresses, email addresses and telephone numbers must be available.

If a client or supplier calls, the staff responding should be quickly able to find the person in the
enterprise database, but this need that the caller's name or his/her company name is listed in the
database.

If a user appears in the databases with two or more slightly different names or different account
numbers, it becomes difficult to update the customer's information.

Transformation

Transformation is the core of the reconciliation phase. It converts records from its operational
source format into a particular data warehouse format. If we implement a three-layer
architecture, this phase outputs our reconciled data layer.

The following points must be rectified in this phase:

o Loose texts may hide valuable information. For example, XYZ PVT Ltd does not
explicitly show that this is a Limited Partnership company.
o Different formats can be used for individual data. For example, data can be saved as a
string or as three integers.

Following are the main transformation processes aimed at populating the reconciled data layer:

o Conversion and normalization that operate on both storage formats and units of measure
to make data uniform.
o Matching that associates equivalent fields in different sources.
o Selection that reduces the number of source fields and records.

Cleansing and Transformation processes are often closely linked in ETL tools.
Loading

The Load is the process of writing the data into the target database. During the load step, it is
necessary to ensure that the load is performed correctly and with as little resources as possible.

Loading can be carried in two ways:

1. Refresh: Data Warehouse data is completely rewritten. This means that older file is
replaced. Refresh is usually used in combination with static extraction to populate a data
warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying preexisting
data. This method is used in combination with incremental extraction to update data
warehouses regularly.

Selecting an ETL Tool

Selection of an appropriate ETL Tools is an important decision that has to be made in choosing
the importance of an ODS or data warehousing application. The ETL tools are required to
provide coordinated access to multiple data sources so that relevant data may be extracted from
them. An ETL tool would generally contains tools for data cleansing, re-organization,
transformations, aggregation, calculation and automatic loading of information into the object
database.

An ETL tool should provide a simple user interface that allows data cleansing and data
transformation rules to be specified using a point-and-click approach. When all mappings and
transformations have been defined, the ETL tool should automatically generate the data
extract/transformation/load programs, which typically run in batch mode.

Mapping the data warehouse architecture to Multiprocessor architecture


Linear Speed up: refers the ability to increase the number of processor to reduce response time
Linear Scale up: refers the ability to provide same performance on the same requests as the
database size increases

Mapping the data warehouse architecture to Multiprocessor architecture

1.Relational data base technology for data warehouse

Linear Speed up: refers the ability to increase the number of processor to reduce response
time Linear Scale up: refers the ability to provide same performance on the same requests as
the database size increases

1.1. Types of parallelism

Inter query Parallelism: In which different server threads or processes handle multiple requests
at the same time.
Intra query Parallelism: This form of parallelism decomposes the serial SQL query into

lower level operations such as scan, join, sort etc. Then these lower level operations are executed
concurrently in parallel.

Intra query parallelism can be done in either of two ways:


Horizontal parallelism: which means that the data base is partitioned across multiple disks and
parallel processing occurs within a specific task that is performed concurrently on different
processors against different set of data

Vertical parallelism: This occurs among different tasks. All query components such as
scan, join, sort etc are executed in parallel in a pipelined fashion. In other words, an output from
one task becomes an input into another task.

1.2 Data partitioning:

Data partitioning is the key component for effective parallel execution of data base operations.
Partition can be done randomly or intelligently.

Random portioning:

Includes random data striping across multiple disks on a single server.


Another option for random portioning is round robin fashion partitioning in which each record is
placed on the next disk assigned to the data base.
Intelligent partitioning:

Assumes that DBMS knows where a specific record is located and does not waste time searching
for it across all disks. The various intelligent partitioning include:

Hash partitioning: A hash algorithm is used to calculate the partition number based on the value
of the partitioning key for each row

Key range partitioning: Rows are placed and located in the partitions according to the value of
the partitioning key. That is all the rows with the key value from A to K are in partition 1, L to T
are in partition 2 and so on.

Schema portioning: an entire table is placed on one disk; another table is placed on different disk
etc. This is useful for small reference tables.

User defined portioning: It allows a table to be partitioned on the basis of a user


defined expression.

2. Data base architectures of parallel processing


There are three DBMS software architecture styles for parallel processing:

Shared memory or shared everything Architecture


Shared disk architecture
Shred nothing architecture

2.1 Shared Memory Architecture:


Tightly coupled shared memory systems, illustrated in following figure have the following
characteristics:

Multiple PUs share memory.


Each PU has full access to all shared memory through a common bus.
Communication between nodes occurs via shared memory.
Performance is limited by the bandwidth of the memory bus.
It is simple to implement and provide a single system image, implementing an RDBMS on
SMP(symmetric multiprocessor)
A disadvantage of shared memory systems for parallel processing is as follows:
Scalability is limited by bus bandwidth and latency, and by available memory.

2.2 Shared Disk Architecture

Shared disk systems are typically loosely coupled. Such systems, illustrated in following figure,
have the following characteristics:

Each node consists of one or more PUs and associated memory.


Memory is not shared between nodes.
Communication occurs over a common high-speed bus.
Each node has access to the same disks and other resources.
A node can be an SMP if the hardware supports it.
Bandwidth of the high-speed bus limits the number of nodes (scalability) of the system.
. The Distributed Lock Manager (DLM ) is required.
Parallel processing advantages of shared disk systems are as follows:
Shared disk systems permit high availability. All data is accessible even if one node dies.

These systems have the concept of one database, which is an advantage over shared nothing
systems.

Shared disk systems provide for incremental growth.


Parallel processing disadvantages of shared disk systems are these:

Inter-node synchronization is required, involving DLM overhead and greater dependency on


high-speed interconnect.

If the workload is not partitioned well, there may be high synchronization overhead.
2.3 Shared Nothing Architecture

Shared nothing systems are typically loosely coupled. In shared nothing systems only one CPU
is connected to a given disk. If a table or database is located on that disk
Shared nothing systems are concerned with access to disks, not access to memory.
Adding more PUs and disks can improve scale up.
Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
Shared nothing systems provide for incremental growth.
System growth is practically unlimited.

MPPs are good for read-only databases and decision support applications.
Failure is local: if one node fails, the others stay up.

Disadvantages
More coordination is required.
More overhead is required for a process working on a disk belonging to another node.
If there is a heavy workload of updates or inserts, as in an online transaction processing system,
it may be worthwhile to consider data-dependent routing to alleviate contention.
These Requirements include
Support for function shipping

Parallel join strategies

Support for data repartitioning

Query compilation

Support for database transactions

Support for the single system image of the database environment

Combined architecture
Interserver parallelism: each query is parallelized across multiple servers
Interaserver parallelism: the query is parallelized with in a server

The combined architecture supports inter server parallelism of distributed memory MPPs and
cluster and interserver parallelism of SMP nodes

3. Parallel DBMS features

Scope and techniques of parallel DBMS operations Optimizer implementation

Application transparency

Parallel environment: which allows the DBMS server to take full advantage of the existing
facilities on a very low level?

DBMS management tools: help to configure, tune, admin and monitor a parallel RDBMS as
effectively as if it were a serial RDBMS

Price / Performance: The parallel RDBMS can demonstrate a non linear speed up and scale up at
reasonable costs.

4. Alternative technologies
For improving performance in dw environment includes

Advanced database indexing products

Multidimensional databases

Specialized RDBMS
Advance indexing techniques REFER --SYSBASE IQ

5. Parallel DBMS Vendors

Oracle: support parallel database processing


Architecture: virtual shared disk capability

Data partitioning: oracle 7 supports random stripping

Parallel operations: oracle may execute all queries serially

Informix: it support full parallelism.


Architecture: it support shared memory, shared disk and shared nothing architecture.

Data partitioning: Informix online 7 supports round-robin, schema, hash, key range partitioning.

Parallel operations: online 7 execute queries INSERT and many utilities in parallel release add
parallel UPDATE and DELETE.

IBM: it is a parallel client/server database product-DB2-E (parallel edition).


Architecture: it is shared nothing architecture.

Data partitioning: hash partitioning.

Parallel operations: INSERT, INDEX, CREATION are full parallelized.

SYBASE: it implemented its parallel DBMS functionality in a product called SYBASE MPP
(SYBSE+NCR).
Architecture: SYBASE MPP –shared nothing architecture.

Data partitioning: SYBASE MPP-key range, schema partitioning.

Parallel operations: SYBASE MPP-horizontal parallelism, but vertical parallelism support in


limited

Other RDBMS products


NCR Teradata

Tandem Nonstop SQL/MP

7. Specialized database products


Red Brick Systems
What is Multi-Dimensional Data Model?

A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.

The dimensions are the perspectives or entities concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records of the store's sales for the
dimension time, item, and location. These dimensions allow the save to keep track of things, for
example, monthly sales of items and the locations at which the items were sold. Each dimension
has a table related to it, called a dimensional table, which describes the dimension further. For
example, a dimensional table for an item may contain the attributes item_name, brand, and type.

A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in
the table. In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an item sold).
The fact or measure displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
Difference between Snowflake Schema and Fact Constellation Schema


Snowflake Schema: Snowflake Schema is a type of multidimensional model. It is used for data
warehouse. In snowflake schema contains the fact table, dimension tables and one or more than
tables for each dimension table. Snowflake schema is a normalized form of star schema which
reduce the redundancy and saves the significant storage. It is easy to operate because it has less
number of joins between the tables and in this simple and less complex query is used for

accessing the data from database.

Advantages:

Reduced data redundancy: The snowflake schema reduces data redundancy by normalizing
dimensions into multiple tables, resulting in a more efficient use of storage space.
Improved performance: The snowflake schema can improve query performance, as it requires
fewer joins to retrieve data from the fact table.
Scalability: The snowflake schema is scalable, making it suitable for large data warehousing
projects with complex hierarchies.

Disadvantages:
Increased complexity: The snowflake schema can be more complex to implement and maintain
due to the additional tables needed for the normalized dimensions.
Reduced query performance: The increased complexity of the snowflake schema can result in
reduced query performance, particularly for queries that require data from multiple dimensions.
Data integrity: The snowflake schema can be more difficult to maintain data integrity due to the
additional relationships between tables.
Fact Constellation Schema: The fact constellation schema is also a type of multidimensional
model. The fact constellation schema consists of dimension tables that are shared by several fact
tables. The fact constellation schema consists of more than one star schema at a time. Unlike the
snowflake schema, the planetarium schema is not really easy to operate, as it has multiple
numbers between tables. Unlike the snowflake schema, the constellation schema, in fact, uses

heavily complex queries to access data from the database. Let’s see the difference between
Snowflake Schema and Fact Constellation Schema:
S.NO Snowflake Schema Fact Constellation

Snowflake schema contains the large While in fact constellation schema,


1. central fact table, dimension tables and dimension tables are shared by many fact
sub dimension tables. tables.

Snowflake schema saves significant While fact constellation schema does not
2.
storage. save storage.

Whereas the fact constellation schema


The snowflake schema consists of one
3. consists of more than one star schema at a
star schema at a time.
time.

In snowflake schema, tables can be In fact constellation schema, the tables are
4.
maintained easily. tough to maintain.

While fact constellation schema is a


Snowflake schema is a normalized form
5. normalized form of snowflake schema and
of star schema.
star schema.

6. Snowflake schema is easy to operate as Fact constellation schema is not easy to


S.NO Snowflake Schema Fact Constellation

compared to fact constellation schema as operate as compared to snowflake schema


it has less number of joins between the as it has multiple number of joins between
tables. the tables.

In snowflake schema, to access the data While in fact constellation schema, to


7. from database simple and less complex access the data from database heavier
query is used. complex query is used.

Fact Constellation Schema:

Advantages:
Simple to understand: The fact constellation schema is easy to understand and maintain, as it
consists of a multiple fact table and multiple dimension tables.
Improved query performance: The fact constellation schema can improve query performance
by reducing the number of joins required to retrieve data from the fact table.
Flexibility: The fact constellation schema is flexible, allowing for the addition of new
dimensions without affecting the existing schema.

Disadvantages:

Increased data redundancy: The fact constellation schema can result in increased data
redundancy due to repeated dimension data across multiple fact tables.
Storage space: The fact constellation schema may require more storage space than the
snowflake schema due to the denormalized dimensions.
Limited scalability: The fact constellation schema may not be as scalable as the snowflake
schema for large data warehousing projects with complex hierarchies.

Star Schema vs Snowflake Schema – Difference Between Them

Key Difference Between Star Schema and Snowflake Schema

 The star schema is the simplest type of Data Warehouse schema. It is known as star
schema as its structure resembles a star.
 Comparing Snowflake vs Star schema, a Snowflake Schema is an extension of a Star
Schema, and it adds additional dimensions. It is called snowflake because its diagram
resembles a Snowflake.
 In a star schema, only single join defines the relationship between the fact table and any
dimension tables.
 Star schema contains a fact table surrounded by dimension tables.
 Snowflake schema is surrounded by dimension table which are in turn surrounded by
dimension table
 A snowflake schema requires many joins to fetch the data.
 Comparing Star vs Snowflake schema, Start schema has simple DB design, while
Snowflake schema has very complex DB design.

What is a Star Schema?

Star Schema in data warehouse, in which the center of the star can have one fact table and a
number of associated dimension tables. It is known as star schema as its structure resembles a
star. The Star Schema data model is the simplest type of Data Warehouse schema. It is also
known as Star Join Schema and is optimized for querying large data sets.

In the following Star Schema example, the fact table is at the center which contains keys to every
dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID & other attributes
like Units sold and revenue.
Example of
Star Schema Diagram

What is a Snowflake Schema?


Snowflake Schema in data warehouse is a logical arrangement of tables in a multidimensional
database such that the ER diagram resembles a snowflake shape. A Snowflake Schema is an
extension of a Star Schema, and it adds additional dimensions. The dimension tables are
normalized which splits data into additional tables.

In the following Snowflake Schema example, Country is further normalized into an individual
table.
Example of Snowflake Schema

Difference between Star Schema and Snowflake Schema


Following is a key difference between Snowflake schema vs Star schema:

Star Schema Snowflake Schema

Hierarchies for the dimensions are stored in the dimensional


Hierarchies are divided into separate tables.
table.

One fact table surrounded by dimension table w


It contains a fact table surrounded by dimension tables.
surrounded by dimension table

In a star schema, only single join creates the relationship


A snowflake schema requires many joins to fetch
between the fact table and any dimension tables.

Simple DB Design. Very Complex DB Design.


Denormalized Data structure and query also run faster. Normalized Data Structure.

High level of Data redundancy Very low-level data redundancy

Single Dimension table contains aggregated data. Data Split into different Dimension Tables.

Cube processing is faster. Cube processing might be slow because of the co

Offers higher performing queries using Star Join Query


The Snowflake schema is represented by cent
Optimization.
which unlikely connected with multiple dimensio
Tables may be connected with multiple dimensions.

You might also like