DWDM U-1
DWDM U-1
Unit-1
Data warehouses serve as a central repository for storing and analyzing information to make
better informed decisions. An organization's data warehouse receives data from a variety of
sources, typically on a regular basis, including transactional systems, relational databases, and
other sources.A data warehouse is a centralized storage system that allows for the storing,
analyzing, and interpreting of data in order to facilitate better decision-making. Transactional
systems, relational databases, and other sources provide data into data warehouses on a regular
basis.A data warehouse is a type of data management system that facilitates and
supports business intelligence (BI) activities, specifically analysis. Data warehouses are
primarily designed to facilitate searches and analyses and usually contain large amounts of
historical data.
A data warehouse can be defined as a collection of organizational data and information extracted
from operational sources and external data sources. The data is periodically pulled from various
internal applications like sales, marketing, and finance; customer-interface applications; as well
as external partner systems. This data is then made available for decision-makers to access and
analyze. So what is data warehouse? For a start, it is a comprehensive repository of current and
historical information that is designed to enhance an organization’s performance.
Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise information rather than the
overall processes of a business. Such subjects may be sales, promotion, inventory, etc. For
example, if you want to analyze your company’s sales data, you need to build a data warehouse
that concentrates on sales. Such a warehouse would provide valuable information like ‘who was
your best customer last year?’ or ‘who is likely to be your best customer in the coming year?’
Integrated
A data warehouse is developed by integrating data from varied sources into a consistent format.
The data must be stored in the warehouse in a consistent and universally acceptable manner in
terms of naming, format, and coding. This facilitates effective data analysis.
Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is read-only. Previous
data is not erased when current data is entered. This helps you to analyze what has happened and
when.
Time-Variant
The data stored in a data warehouse is documented with an element of time, either explicitly or
implicitly. An example of time variance in Data Warehouse is exhibited in the Primary Key,
which must have an element of time like the day, week, or month.
Although a data warehouse and a traditional database share some similarities, they need not be
the same idea. The main difference is that in a database, data is collected for multiple
transactional purposes. However, in a data warehouse, data is collected on an extensive scale to
perform analytics. Databases provide real-time data, while warehouses store data to be accessed
for big analytical queries.
Bottom Tier
The bottom tier or data warehouse server usually represents a relational database system. Back-
end tools are used to cleanse, transform and feed data into this layer.
Middle Tier
The middle tier represents an OLAP server that can be implemented in two ways.
The ROLAP or Relational OLAP model is an extended relational database management system
that maps multidimensional data process to standard relational process.
The MOLAP or multidimensional OLAP directly acts on multidimensional data and operations.
Top Tier
This is the front-end client interface that gets data out from the data warehouse. It holds various
tools like query tools, analysis tools, reporting tools, and data mining tools.
Data Warehousing integrates data and information collected from various sources into one
comprehensive database. For example, a data warehouse might combine customer information
from an organization’s point-of-sale systems, its mailing lists, website, and comment cards. It
might also incorporate confidential information about employees, salary information, etc.
Businesses use such components of data warehouse to analyze customers.
Data mining is one of the features of a data warehouse that involves looking for meaningful data
patterns in vast volumes of data and devising innovative strategies for increased sales and
profits.
This type of warehouse serves as a key or central database that facilitates decision-support
services throughout the enterprise. The advantage to this type of warehouse is that it provides
access to cross-organizational information, offers a unified approach to data representation, and
allows running complex queries.
Operational Data Store (ODS)
This type of data warehouse refreshes in real-time. It is often preferred for routine activities like
storing employee records. It is required when data warehouse systems do not support reporting
needs of the business.
Data Mart
A data mart is a subset of a data warehouse built to maintain a particular department, region, or
business unit. Every department of a business has a central repository or data mart to store data.
The data from the data mart is stored in the ODS periodically. The ODS then sends the data to
the EDW, where it is stored and used.
Let us look at some examples of how companies use data warehouse as an integral part of their
day-to-day operations.
Investment and Insurance companies use data warehouses to primarily analyze customer and
market trends and allied data patterns. In sub-sectors like Forex and stock markets, data
warehouse plays a significant role because a single point difference can result in huge losses
across the board.
Retail chains use data warehouses for marketing and distribution, so they can track items,
examine pricing policies and analyze buying trends of customers. They use data warehouse
models for business intelligence and forecasting needs.
Healthcare companies, on the other hand, use data warehouse concepts to generate treatment
reports, share data with insurance companies and in research and medical units. Healthcare
systems depend heavily upon enterprise data warehouses because they need the latest, updated
treatment information to save lives.
Wondering what Data warehouse tools is? Well, these are software components used to perform
several operations on an extensive data set. These tools help to collect, read, write and transfer
data from various sources. What do data warehouses support? They are designed to support
operations like data sorting, filtering, merging, etc.
Data warehouse applications can be categorized as:
OLAP tools
Some popular data warehouse tools are Xplenty, Amazon Redshift, Teradata, Oracle 12c,
Informatica, IBM Infosphere, Cloudera, and Panoply.
Wondering why businesses need data warehousing? Well, there are several benefits of data
warehouse for end users.
Enabling end-users to ask ad-hoc queries or reports without deterring the performance of
operational systems
Companies having dedicated Data Warehouse teams emerge ahead of others in key areas of
product development, pricing, marketing, production time, historical analysis, forecasting, and
customer satisfaction. Though data warehouses can be slightly expensive, they pay in the long
run.
Architecture is the proper arrangement of the elements. We build a data warehouse with software
and hardware components. To suit the requirements of our organizations, we arrange these
building we may want to boost up another part with extra tools and services. All of these depends
on our circumstances.
The figure shows the essential elements of a typical warehouse. We see the Source Data
component shows on the left. The Data staging element serves as the next building block. In the
middle, we see the Data Storage component that handles the data warehouses data. This element
not only stores and manages the data; it also keeps track of data using the metadata repository.
The Information Delivery component shows on the right consists of all the different ways of
making the information from the data warehouses available to the users.
Source data coming into the data warehouses may be grouped into four broad categories:
Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments of the
data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data, part of
which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In every
operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry produced
by the external department.
After we have been extracted data from various operational systems and external sources, we
have to prepare the files for storing in the data warehouse. The extracted data coming from
several different sources need to be changed, converted, and made ready in a format that is
relevant to be saved for querying and analysis.
1) Data Extraction: This method has to deal with numerous data sources. We have to employ
the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many different
sources. If data extraction for a data warehouse posture big challenges, data transformation
present even significant challenges. We perform several individual tasks as part of data
transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or elimination
of duplicates when we bring in the same data from various source systems.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first time, we
do the initial loading of the information into the data warehouse storage. The initial load moves
high volumes of data using up a substantial amount of time.
Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data repositories include
the data structured in highly normalized for fast and efficient processing.
The information delivery element is used to enable the process of subscribing for data warehouse
files and having it transferred to one or more destinations according to some customer-specified
scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the data about the logical data structures,
the data about the records and addresses, the information about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users. The scope
is confined to particular selected subjects. Data in a data warehouse should be a fairly current,
but not mainly up to the minute, although development in the data warehouse industry has made
standard and incremental data dumps more achievable. Data marts are lower than data
warehouses and usually contain organization. The current trends in data warehousing are to
developed a data warehouse with several smaller related data marts for particular kinds of
queries and reports.
The management and control elements coordinate the services and functions within the data
warehouse. These components control the data transformation and the data transfer into the data
warehouse storage. On the other hand, it moderates the data delivery to the clients. Its work with
the database management systems and authorizes data to be correctly saved in the repositories. It
monitors the movement of information into the staging method and from there into the data
warehouses storage itself.
Data Warehouse queries are complex because they involve the computation of large groups of
data at summarized levels.
It may require the use of distinctive data organization, access, and implementation method based
on multidimensional views.
Performing OLAP queries in operational database degrade the performance of functional tasks.
1. It is used for Online Transactional Processing (OLTP) but 1. It is used for Online Analytical Processing
can be used for other objectives such as Data Warehousing. (OLAP). This reads the historical information
This records the data from the clients for history. for the customers for business decisions.
2. The tables and joins are complicated since they are 2. The tables and joins are accessible since they
normalized for RDBMS. This is done to reduce redundant are de-normalized. This is done to minimize the
files and to save storage space. response time for analytical queries.
4. Entity: Relational modeling procedures are used for 4. Data: Modeling approach are used for the
RDBMS database design. Data Warehouse design.
6. Performance is low for analysis queries. 6. High performance for analytical queries.
7. The database is the place where the data is taken as a base 7. Data Warehouse is the place where the
and managed to get available fast and efficient access. application data is handled for analysis and
reporting objectives.
Data Warehouse is used for analysis and decision making in which extensive database is
required, including historical data, which operational database does not typically maintain.
The separation of an operational database from data warehouses is based on the different
structures and uses of data in these systems.
Because the two systems provide different functionalities and require different kinds of data, it is
necessary to maintain separate databases.
1. It is used for Online Transactional Processing (OLTP) but 1. It is used for Online Analytical Processing
can be used for other objectives such as Data Warehousing. (OLAP). This reads the historical information
This records the data from the clients for history. for the customers for business decisions.
2. The tables and joins are complicated since they are 2. The tables and joins are accessible since they
normalized for RDBMS. This is done to reduce redundant are de-normalized. This is done to minimize the
files and to save storage space. response time for analytical queries.
3. Data is dynamic 3. Data is largely static
4. Entity: Relational modeling procedures are used for 4. Data: Modeling approach are used for the
RDBMS database design. Data Warehouse design.
6. Performance is low for analysis queries. 6. High performance for analytical queries.
7. The database is the place where the data is taken as a base 7. Data Warehouse is the place where the
and managed to get available fast and efficient access. application data is handled for analysis and
reporting objectives.
The Operational Database is the source of information for the data warehouse. It includes
detailed information used to run the day to day operations of the business. The data frequently
changes as updates are made and reflect the current value of the last transactions.
Data Warehouse Systems serve users or knowledge workers in the purpose of data analysis and
decision-making. Such systems can organize and present information in specific formats to
accommodate the diverse needs of various users. These systems are called as Online-Analytical
Processing (OLAP) Systems.
Data Warehouse and the OLTP database are both relational databases. However, the goals of
both these databases are different.
Operational systems are designed to support high- Data warehousing systems are typically designed to
volume transaction processing. support high-volume analytical processing (i.e.,
OLAP).
Operational systems are usually concerned with current Data warehousing systems are usually concerned
data. with historical data.
Data within operational systems are mainly updated Non-volatile, new data may be added regularly.
regularly according to need. Once Added rarely changed.
It is designed for real-time business dealing and It is designed for analysis of business measures by
processes. subject area, categories, and attributes.
It is optimized for a simple set of transactions, generally It is optimized for extent loads and high, complex,
adding or retrieving a single row at a time per table. unpredictable queries that access many rows per
table.
It is optimized for validation of incoming information Loaded with consistent, valid information, requires
during transactions, uses validation data tables. no real-time validation.
Operational systems are widely process-oriented. Data warehousing systems are widely subject-
oriented
Operational systems are usually optimized to perform Data warehousing systems are usually optimized to
fast inserts and updates of associatively small volumes perform fast retrievals of relatively high volumes of
of data. data.
Relational databases are created for on-line transactional Data Warehouse designed for on-line Analytical
Processing (OLTP) Processing (OLAP)
OLTP System handle with operational data. Operational data are those data contained in the
operation of a particular system. Example, ATM transactions and Bank transactions, etc.
OLAP System
OLAP handle with Historical Data or Archival Data. Historical data are those data that are
achieved over a long period. For example, if we collect the last 10 years information about flight
reservation, the data can give us much meaningful data such as the trends in the reservation. This
may provide useful information like peak time of travel, what kind of people are traveling in
various classes (Economy/Business) etc.
The major difference between an OLTP and OLAP system is the amount of data analyzed in a
single transaction. Whereas an OLTP manage many concurrent customers and queries touching
only an individual record or limited groups of files at a time. An OLAP system must have the
capability to operate on millions of files to answer a single query.
Users Clerks, clients, and information Knowledge workers, including managers, executives,
technology professionals. and analysts.
Data contents OLTP system manages current data OLAP system manages a large amount of historical
that too detailed and are used for data, provides facilitates for summarization and
decision making. aggregation, and stores and manages data at different
levels of granularity. This information makes the
data more comfortable to use in informed decision
making.
Database Size 100 MB-GB 100 GB-TB
Database design OLTP system usually uses an entity- OLAP system typically uses either a star or
relationship (ER) data model and snowflake model and subject-oriented database
application-oriented database design. design.
View OLTP system focuses primarily on OLAP system often spans multiple versions of a
the current data within an enterprise database schema, due to the evolutionary process of
or department, without referring to an organization. OLAP systems also deal with data
historical information or data in that originates from various organizations,
different organizations. integrating information from many data stores.
Volume of data Not very large Because of their large volume, OLAP data are stored
on multiple storage media.
Access patterns The access patterns of an OLTP Accesses to OLAP systems are mostly read-only
system subsist mainly of short, methods because of these data warehouses stores
atomic transactions. Such a system historical data.
requires concurrency control and
recovery techniques.
Insert and Short and fast inserts and updates Periodic long-running batch jobs refresh the data.
Updates proposed by end-users.
Production applications such as payroll accounts payable product purchasing and inventory
control are designed for online transaction processing (OLTP). Such applications gather detailed
data from day to day operations.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an
activity recently dubbed online analytical processing (OLAP). These include applications such as
forecasting, profiling, summary reporting, and trend analysis.
Production databases are updated continuously by either by hand or via OLTP applications. In
contrast, a warehouse database is updated from operational systems periodically, usually during
off-hours. As OLTP data accumulates in production databases, it is regularly extracted, filtered,
and then loaded into a dedicated warehouse server that is accessible to users. As the warehouse is
populated, it must be restructured tables de-normalized, data cleansed of errors and redundancies
and new fields and keys added to reflect the needs to the user for sorting, combining, and
summarizing data.
Data warehouses and their architectures very depending upon the elements of an organization's
situation.
Operational System
An operational system is a method used in data warehousing to refer to a system that is used to
process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the
system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding and work with
particular instances of data more accessible. For example, author, data build, and data changed,
and file size are examples of very basic document metadata.
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized
record is updated continuously as new information is loaded into the warehouse.
The principal purpose of a data warehouse is to provide information to the business managers for
strategic decision-making. These customers interact with the warehouse using end-client access
tools.
e can do this programmatically, although data warehouses uses a staging area (A place where
data is processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method coming from
multiple source systems, especially for enterprise data warehouses where all relevant data of an
enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from source systems is
copied.
We may want to customize our warehouse's architecture for multiple groups within our
organization.
We can do this by adding data marts. A data mart is a segment of a data warehouses that can
provided information for reporting and analysis on a section, unit, department or operation in the
company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In this
example, a financial analyst wants to analyze historical data for purchases and sales or mine
historical information to make predictions about customer behavior.
The following architecture properties are necessary for a data warehouse system:
1. Separation: Analytical and transactional processing should be keep apart as much as possible.
2. Scalability: Hardware and software architectures should be simple to upgrade the data
volume, which has to be managed and processed, and the number of user's requirements, which
have to be met, progressively increase.
3. Extensibility: The architecture should be able to perform new operations and technologies
without redesigning the whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the data
warehouses.
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the
amount of data stored to reach this goal; it removes data redundancies.
The figure shows the only layer physically available is the source layer. In this method, data
warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an intermediate
processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for separation
between analytical and transactional processing. Analysis queries are agreed to operational data
after the middleware interprets them. In this way, queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture for a
data warehouse system, as shown in fig:
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source system), the
reconciled layer and the data warehouse layer (containing both data warehouses and data marts).
The reconciled layer sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for
a whole enterprise. At the same time, it separates the problems of source data extraction and
integration from those of data warehouse population. In some cases, the reconciled layer is also
directly used to accomplish better some operational tasks, such as producing daily reports that
cannot be satisfactorily prepared using the corporate applications or generating data flows to feed
external processes periodically to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A disadvantage
of this structure is the extra file storage space used through the extra redundant reconciled layer.
It also makes the analytical tools a little further away from being real-time.
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps
functions on multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.
A top-tier that contains front-end tools for displaying results provided by OLAP, as well as
additional tools for data mining of the OLAP-generated data.
Load Performance
Data warehouses require increase loading of new data periodically basis within narrow time
windows; performance on the load process should be measured in hundreds of millions of rows
and gigabytes per hour and must not artificially constrain the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data warehouse, including data
conversion, filtering, reformatting, indexing, and metadata update.
Fact-based management demands the highest data quality. The warehouse ensures local
consistency, global consistency, and referential integrity despite "dirty" sources and massive
database size.
Query Performance
Fact-based management must not be slowed by the performance of the data warehouse RDBMS;
large, complex queries must be complete in seconds, not days.
Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size from a few to hundreds
of gigabytes and terabyte-sized data warehouses.
What is ETL?
The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation and
Loading.
The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs to change
with business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse
system and needs to be agile, automated, and well documented.
How ETL Works?
Extraction
o Extraction is the operation of extracting information from a source system for further use
in a data warehouse environment. This is the first stage of the ETL process.
o Extraction process is often one of the most time-consuming tasks in the ETL.
o The source systems might be complicated and poorly documented, and thus determining
which data needs to be extracted can be difficult.
o The data has to be extracted several times in a periodic manner to supply all changed data
to the warehouse and keep it up-to-date.
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to improve
data quality. The primary data cleansing features found in ETL tools are rectification and
homogenization. They use specific dictionaries to rectify typing mistakes and to recognize
synonyms, as well as rule-based cleansing to enforce domain-specific rules and defines
appropriate associations between values.
The following examples show the essential of data cleaning:
If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date
list of contact addresses, email addresses and telephone numbers must be available.
If a client or supplier calls, the staff responding should be quickly able to find the person in the
enterprise database, but this need that the caller's name or his/her company name is listed in the
database.
If a user appears in the databases with two or more slightly different names or different account
numbers, it becomes difficult to update the customer's information.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its operational
source format into a particular data warehouse format. If we implement a three-layer
architecture, this phase outputs our reconciled data layer.
o Loose texts may hide valuable information. For example, XYZ PVT Ltd does not
explicitly show that this is a Limited Partnership company.
o Different formats can be used for individual data. For example, data can be saved as a
string or as three integers.
Following are the main transformation processes aimed at populating the reconciled data layer:
o Conversion and normalization that operate on both storage formats and units of measure
to make data uniform.
o Matching that associates equivalent fields in different sources.
o Selection that reduces the number of source fields and records.
Cleansing and Transformation processes are often closely linked in ETL tools.
Loading
The Load is the process of writing the data into the target database. During the load step, it is
necessary to ensure that the load is performed correctly and with as little resources as possible.
1. Refresh: Data Warehouse data is completely rewritten. This means that older file is
replaced. Refresh is usually used in combination with static extraction to populate a data
warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying preexisting
data. This method is used in combination with incremental extraction to update data
warehouses regularly.
Selection of an appropriate ETL Tools is an important decision that has to be made in choosing
the importance of an ODS or data warehousing application. The ETL tools are required to
provide coordinated access to multiple data sources so that relevant data may be extracted from
them. An ETL tool would generally contains tools for data cleansing, re-organization,
transformations, aggregation, calculation and automatic loading of information into the object
database.
An ETL tool should provide a simple user interface that allows data cleansing and data
transformation rules to be specified using a point-and-click approach. When all mappings and
transformations have been defined, the ETL tool should automatically generate the data
extract/transformation/load programs, which typically run in batch mode.
Linear Speed up: refers the ability to increase the number of processor to reduce response
time Linear Scale up: refers the ability to provide same performance on the same requests as
the database size increases
Inter query Parallelism: In which different server threads or processes handle multiple requests
at the same time.
Intra query Parallelism: This form of parallelism decomposes the serial SQL query into
lower level operations such as scan, join, sort etc. Then these lower level operations are executed
concurrently in parallel.
Vertical parallelism: This occurs among different tasks. All query components such as
scan, join, sort etc are executed in parallel in a pipelined fashion. In other words, an output from
one task becomes an input into another task.
Data partitioning is the key component for effective parallel execution of data base operations.
Partition can be done randomly or intelligently.
Random portioning:
Assumes that DBMS knows where a specific record is located and does not waste time searching
for it across all disks. The various intelligent partitioning include:
Hash partitioning: A hash algorithm is used to calculate the partition number based on the value
of the partitioning key for each row
Key range partitioning: Rows are placed and located in the partitions according to the value of
the partitioning key. That is all the rows with the key value from A to K are in partition 1, L to T
are in partition 2 and so on.
Schema portioning: an entire table is placed on one disk; another table is placed on different disk
etc. This is useful for small reference tables.
Shared disk systems are typically loosely coupled. Such systems, illustrated in following figure,
have the following characteristics:
These systems have the concept of one database, which is an advantage over shared nothing
systems.
If the workload is not partitioned well, there may be high synchronization overhead.
2.3 Shared Nothing Architecture
Shared nothing systems are typically loosely coupled. In shared nothing systems only one CPU
is connected to a given disk. If a table or database is located on that disk
Shared nothing systems are concerned with access to disks, not access to memory.
Adding more PUs and disks can improve scale up.
Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
Shared nothing systems provide for incremental growth.
System growth is practically unlimited.
MPPs are good for read-only databases and decision support applications.
Failure is local: if one node fails, the others stay up.
Disadvantages
More coordination is required.
More overhead is required for a process working on a disk belonging to another node.
If there is a heavy workload of updates or inserts, as in an online transaction processing system,
it may be worthwhile to consider data-dependent routing to alleviate contention.
These Requirements include
Support for function shipping
Query compilation
Combined architecture
Interserver parallelism: each query is parallelized across multiple servers
Interaserver parallelism: the query is parallelized with in a server
The combined architecture supports inter server parallelism of distributed memory MPPs and
cluster and interserver parallelism of SMP nodes
Application transparency
Parallel environment: which allows the DBMS server to take full advantage of the existing
facilities on a very low level?
DBMS management tools: help to configure, tune, admin and monitor a parallel RDBMS as
effectively as if it were a serial RDBMS
Price / Performance: The parallel RDBMS can demonstrate a non linear speed up and scale up at
reasonable costs.
4. Alternative technologies
For improving performance in dw environment includes
Multidimensional databases
Specialized RDBMS
Advance indexing techniques REFER --SYSBASE IQ
Data partitioning: Informix online 7 supports round-robin, schema, hash, key range partitioning.
Parallel operations: online 7 execute queries INSERT and many utilities in parallel release add
parallel UPDATE and DELETE.
SYBASE: it implemented its parallel DBMS functionality in a product called SYBASE MPP
(SYBSE+NCR).
Architecture: SYBASE MPP –shared nothing architecture.
A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records of the store's sales for the
dimension time, item, and location. These dimensions allow the save to keep track of things, for
example, monthly sales of items and the locations at which the items were sold. Each dimension
has a table related to it, called a dimensional table, which describes the dimension further. For
example, a dimensional table for an item may contain the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in
the table. In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an item sold).
The fact or measure displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
Difference between Snowflake Schema and Fact Constellation Schema
Snowflake Schema: Snowflake Schema is a type of multidimensional model. It is used for data
warehouse. In snowflake schema contains the fact table, dimension tables and one or more than
tables for each dimension table. Snowflake schema is a normalized form of star schema which
reduce the redundancy and saves the significant storage. It is easy to operate because it has less
number of joins between the tables and in this simple and less complex query is used for
Advantages:
Reduced data redundancy: The snowflake schema reduces data redundancy by normalizing
dimensions into multiple tables, resulting in a more efficient use of storage space.
Improved performance: The snowflake schema can improve query performance, as it requires
fewer joins to retrieve data from the fact table.
Scalability: The snowflake schema is scalable, making it suitable for large data warehousing
projects with complex hierarchies.
Disadvantages:
Increased complexity: The snowflake schema can be more complex to implement and maintain
due to the additional tables needed for the normalized dimensions.
Reduced query performance: The increased complexity of the snowflake schema can result in
reduced query performance, particularly for queries that require data from multiple dimensions.
Data integrity: The snowflake schema can be more difficult to maintain data integrity due to the
additional relationships between tables.
Fact Constellation Schema: The fact constellation schema is also a type of multidimensional
model. The fact constellation schema consists of dimension tables that are shared by several fact
tables. The fact constellation schema consists of more than one star schema at a time. Unlike the
snowflake schema, the planetarium schema is not really easy to operate, as it has multiple
numbers between tables. Unlike the snowflake schema, the constellation schema, in fact, uses
heavily complex queries to access data from the database. Let’s see the difference between
Snowflake Schema and Fact Constellation Schema:
S.NO Snowflake Schema Fact Constellation
Snowflake schema saves significant While fact constellation schema does not
2.
storage. save storage.
In snowflake schema, tables can be In fact constellation schema, the tables are
4.
maintained easily. tough to maintain.
Advantages:
Simple to understand: The fact constellation schema is easy to understand and maintain, as it
consists of a multiple fact table and multiple dimension tables.
Improved query performance: The fact constellation schema can improve query performance
by reducing the number of joins required to retrieve data from the fact table.
Flexibility: The fact constellation schema is flexible, allowing for the addition of new
dimensions without affecting the existing schema.
Disadvantages:
Increased data redundancy: The fact constellation schema can result in increased data
redundancy due to repeated dimension data across multiple fact tables.
Storage space: The fact constellation schema may require more storage space than the
snowflake schema due to the denormalized dimensions.
Limited scalability: The fact constellation schema may not be as scalable as the snowflake
schema for large data warehousing projects with complex hierarchies.
The star schema is the simplest type of Data Warehouse schema. It is known as star
schema as its structure resembles a star.
Comparing Snowflake vs Star schema, a Snowflake Schema is an extension of a Star
Schema, and it adds additional dimensions. It is called snowflake because its diagram
resembles a Snowflake.
In a star schema, only single join defines the relationship between the fact table and any
dimension tables.
Star schema contains a fact table surrounded by dimension tables.
Snowflake schema is surrounded by dimension table which are in turn surrounded by
dimension table
A snowflake schema requires many joins to fetch the data.
Comparing Star vs Snowflake schema, Start schema has simple DB design, while
Snowflake schema has very complex DB design.
Star Schema in data warehouse, in which the center of the star can have one fact table and a
number of associated dimension tables. It is known as star schema as its structure resembles a
star. The Star Schema data model is the simplest type of Data Warehouse schema. It is also
known as Star Join Schema and is optimized for querying large data sets.
In the following Star Schema example, the fact table is at the center which contains keys to every
dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID & other attributes
like Units sold and revenue.
Example of
Star Schema Diagram
In the following Snowflake Schema example, Country is further normalized into an individual
table.
Example of Snowflake Schema
Single Dimension table contains aggregated data. Data Split into different Dimension Tables.