0% found this document useful (0 votes)
59 views16 pages

Dwbi Unit-1

The document provides an overview of data warehousing and business intelligence, defining key concepts such as data, data warehouses, and their benefits. It distinguishes between operational databases and data warehouses, highlighting their different functions, structures, and user requirements. Additionally, it covers OLAP operations, data cube classifications, and modeling techniques essential for effective data analysis and decision-making.

Uploaded by

yashusai567
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views16 pages

Dwbi Unit-1

The document provides an overview of data warehousing and business intelligence, defining key concepts such as data, data warehouses, and their benefits. It distinguishes between operational databases and data warehouses, highlighting their different functions, structures, and user requirements. Additionally, it covers OLAP operations, data cube classifications, and modeling techniques essential for effective data analysis and decision-making.

Uploaded by

yashusai567
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

R-22

DATA WAREHOUSING AND BUSINESS INTELLIGENCE


UNIT-I

UNIT - I : Data Warehouse, Data Warehouse Modelling, OLAP operations, Data Cube Computation
methods.

What is Data?

Data is a collection of information gathered by observations, measurements, research or analysis. They


may consist of facts, numbers, names, figures or even description of things. Data is organized in the form
of graphs, charts or tables. There exist data scientist who does data mining and with the help of that data
analyse our world.

You must have watched the weather forecast reports on news channels. They list the minimum
temperature, the maximum temperature, rainfall predictions and measurements. The tabular
representation is indicated below.

Classification of Data

Data is classified into

 Qualitative: It describes the quality of something or someone. It is descriptive information. For


example, the skin colour, eye colour, hair texture, etc. gives us the qualitative information about a
person.

 Quantitative: It provides numerical information. Example, the height and weight of a person.

Data Warehouse Definition :

A Data Warehouse (DW) is a relational database that is designed for query and analysis rather
than transaction processing. It includes historical data derived from transaction data from single
1
R-22

and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only to a particular group
of users.

It is not used for daily operations and transaction processing but used for making decisions

Subject-Oriented :

A data warehouse target on the modeling and analysis of data for decision-makers. Therefore,
data warehouses typically provide a concise and straightforward view around a particular
subject, such as customer, product, or sales, instead of the global organization's ongoing
operations. This is done by excluding data that are not useful concerning the subject and
including all data needed by the users to understand the subject.

Integrated :

A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among
different data sources

Time-Variant :

Historical information is kept in a data warehouse. For example, one can retrieve files from 3
2
R-22

months, 6 months, 12 months, or even previous data from a data warehouse. These
variations with a transactions system, where often only the most current file is kept

Non-Volatile:

The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data.

Benefits of Data Warehouse :

1. Understand business trends and make better forecasting decisions.

2. Data Warehouses are designed to perform well enormous amounts of data.

3. The structure of data warehouses is more accessible for end-users to navigate,


understand, and query.

4. Queries that would be complex in many normalized databases could be easier to build
and maintain in data warehouses.

5. Data warehousing is an efficient method to manage demand for lots of information from
lots of users.

6. Data warehousing provide the capabilities to analyze a large amount of historical data.

Need for Data Warehouse :

3
R-22

1. Business User: Business users require a data warehouse to view summarized data from the
past. Since these people are non-technical, the data may be presented to them in an
elementary form.

2. Store historical data: Data Warehouse is required to store the time variable data from the
past. This input is made to be used for various purposes.

3. Make strategic decisions: Some strategies may be depending upon the data in the
data warehouse. So, data warehouse contributes to making strategic decisions.

4. For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and consistency in
data.

5. High response time: Data warehouse has to be ready for somewhat unexpected loads and
types of queries, which demands a significant degree of flexibility and quick response time.

Database Vs Datawarehouse

Feature Database Data Warehouse

Store and manage data for day-to- Analyze historical data for
Primary Function
day operations trends and insights

Optimized for fast retrieval and Optimized for complex


Data
modification (CRUD – Create, Read, queries and analysis (OLAP –
Structure
Update, Delete) Online Analytical Processing)

Primarily historical and


Data
Primarily current data integrated data from various
Currency
sources

4
R-22

Often denormalized to
Highly normalized to minimize
Schema improve query performance
redundancy
for analysis

Frequent updates as transactions Periodic updates (batch


Updates
occur processing)

Operational applications, individual Business analysts, data


Users
users scientists, executives

Focuses on data governance


Focuses on data integrity and access
Security and access control for
control for specific users
analytical purposes

More complex to design,


implement, and maintain
Complexity Simpler to design and manage
due to data integration and
transformation

Higher cost due to larger


Lower cost due to smaller size and
Cost storage requirements and
simpler infrastructure
processing power

Data Warehouse Modeling

Data warehouse modeling is the process of designing the schemas of the detailed and
summarized information of the data warehouse. The goal of data warehouse modeling is to
develop a schema describing the reality, or at least a part of the fact, which the data warehouse
is needed to support.

Data warehouse modeling is an essential stage of building a data warehouse for two main
5
R-22

reasons. Firstly, through the schema, data warehouse clients can visualize the relationships
among the warehouse data, to use them with greater ease. Secondly, a well-designed schema
allows an effective data warehouse structure to emerge, to help decrease the cost of
implementing the warehouse and improve the efficiency of using it.

Data modeling in data warehouses is different from data modeling in operational database
systems. The primary function of data warehouses is to support DSS processes. Thus, the
objective of data warehouse modeling is to make the data warehouse efficiently support
complex queries on long term information.

In contrast, data modeling in operational database systems targets efficiently supporting simple
transactions in the database such as retrieving, inserting, deleting, and changing data. Moreover,
data warehouses are designed for the customer with general information knowledge about the
enterprise, whereas operational database systems are more oriented toward use by software
specialists for creating distinct applications.

6
R-22

o Reflects the most current happenings, which are commonly the most stimulating.

o It is numerous as it is saved at the lowest method of the Granularity.

o It is always (almost) saved on disk storage, which is fast to access but expensive

and difficult to manage.

Older detail data is stored in some form of mass storage, and it is infrequently accessed
and kept at a level detail consistent with current detailed data.

Lightly summarized data is data extract from the low level of detail found at the
current, detailed level and usually is stored on disk storage. When building the data
warehouse have

to remember what unit of time is summarization done over and also the components or
what attributes the summarized data will contain.

Highly summarized data is compact and directly available and can even be found outside
the warehouse.

Difference between OLTP and OLAP

Below is the difference between OLAP and OLTP in Data Warehouse:

Para
meter OLTP OLAP
s

It is an online transactional
Proce OLAP is an online analysis
system. It manages database
ss and data retrieving process.
modification.

Chara It is characterized by large


It is characterized by a large
cterist numbers of short online
volume of data.
ic transactions.

7
R-22

Para
meter OLTP OLAP
s

Functi
OLTP is an online database OLAP is an online database query
onalit
modifying system. management system.
y

OLTP uses
Meth OLAP uses the data
traditional
od warehouse.
DBMS.

Insert, Update, and Delete Mostly select


Query
information from the database. operations

Tables in OLTP database Tables in OLAP database are


Table
are normalized. not normalized.

Sourc OLTP and its transactions are the Different OLTP databases become
e sources of data. the source of data for OLAP.

Data OLAP database does not get


OLTP database must maintain data
Integr frequently modified. Hence,
integrity constraint.
ity data integrity is not an issue.

Respons It’s response time is in Response time in


e time millisecond. seconds to minutes.

Data The data in the OLTP database


The data in OLAP process might
qualit is always detailed and
not be organized.
y organized.

8
R-22

Para
meter OLTP OLAP
s

Useful It helps to control and run It helps with planning, problem-


ness fundamental business tasks. solving, and decision support.

Opera Only read and rarely


Allow read/write operations.
tion write.

OLAP operations :

OLAP stands for Online Analytical Processing (OLAP) could be a innovation that’s utilized to
organize expansive business databases and back business intelligence. OLAP databases are
separated into one or more cubes, and each cube is organized and designed by a cube
administrator to fit the way simply recover and analyze data so that it is less demanding to
form and utilize the PivotTable reports and PivotChart reports that and just require.

OLAP operations:

There are five basic analytical operations that can be performed on an OLAP cube:

Drill down: In drill-down operation, the less detailed data is converted into highly detailed
data. It can be done by:
 Moving down in the concept hierarchy
 Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving down in
the concept hierarchy of Time dimension (Quarter -> Month).

9
R-22

Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP cube. It
can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by climbing up in the
concept hierarchy of Location dimension (City -> Country)

Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the cube
given in the overview section, a sub-cube is selected by selecting following dimensions with criteria:
 Location = “Delhi” or “Kolkata”
 Time = “Q1” or “Q2”
 Item = “Car” or “Bus”

10
R-22

Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube creation. In
the cube given in the overview section, Slice is performed on the dimension Time = “Q1”

Pivot: It is also known as rotation operation as it rotates the current view to get a new view of the
representation. In the sub-cube obtained after the slice operation, performing pivot operation
gives a new view of it.

What is a data cube?


A data cube is a multidimensional representation of data intended to facilitate effortless retrieval and
11
R-22

structured analysis of the underlying data. When organized in a cube rather than a network of relational tables,
it becomes easier for the user to establish relationships between data that could otherwise be challenging to
figure out. This directly results in enhanced in-depth analysis and advanced drill down. Every face of the cube
can be programmed to represent a particular category and users can pivot the cube to look at the same data
from a unique perspective.

While data cubes can exist as a simple representation of data, without any extensive capabilities to analyze
large volumes, OLAP data cubes are particularly valuable for complex data analysis, including business
intelligence as they provide a comprehensive view of information across different dimensions, such as time,
products, locations, or customer segments. For example, if you are looking at a sales data cube, different
dimensions can show you data by year, product category, locations, customers, etc. So, whenever we mention a
data cube from here on, we will be referring to an OLAP model.
What are data cube classifications?
OLAP emerged as a response to the limitations of relational databases for analytical and multidimensional data
processing. OLAP databases are optimized for complex queries, multidimensional analysis, and fast retrieval of
processed data. They allow users to interact with data from different angles and hierarchies. OLAP has
developed into three major classifications:
 ROLAP: Relational OLAP stores and processes data in a relational database. ROLAP servers
typically query the relational database to generate reports and analyze data. Although scalable to large
volumes of data, ROLAP can encounter scalability issues after a certain point due to its dependency on
the underlying database. It integrates well with existing relational databases which makes it easier to
implement and maintain.
 MOLAP: Multidimensional OLAP stores and processes data in a multidimensional format
conceptualized like a cube. This structure makes it easier to establish relationships between data points
and optimize the data for complex analytical queries. MOLAP cubes are pre-calculated and stored in a
separate database from the source data. With many operations done beforehand, MOLAP does not
require as much processing power from the relational database server.
 HOLAP: Hybrid OLAP combines the best features of ROLAP and MOLAP. HOLAP stores some data
in a relational database and some data in the multidimension format. Therefore, HOLAP provides both
the scalability of ROLAP and the performance of MOLAP for organizations with large and complex data
sets.
12
R-22

What are the data cube operations?


Data cubes support various operations that allow users to examine and analyze data from different
perspectives. Here is an overview of some key data cube operations:
 Roll-up: This operation adds up all the data from a category and presents it as a singular record.
It is like zooming out of the cube and looking at the data from a broader perspective.
 Drill-down: While trying to access a transaction on the point-of-access, users need to descend
into a dimension hierarchy. For instance, drilling down on the product dimension in the sales data
cube would provide detailed sales figures for each product within each region.
 Slicing: When users want to focus on a specific set of facts from a particular dimension, they can
filter the data to focus on that subset. Slicing a sales data cube to focus on “Electronics” would restrict
the data to sales of electronic products only.
 Dicing: Breaking the data into multiple slices from a data cube can isolate a particular
combination of factors for analysis. By selecting a subset of values from each dimension, the user can
focus on the point where the two dimensions intersect each other. For example, dicing the product
dimension to “Electronics” and the region dimension to “Asia” would restrict the data to sales of
electronic products in the Asian region.
 Pivoting: Pivoting means rotating the cube to view the data from a unique perspective or
reorienting analysis to focus on a different aspect. Pivoting the sales data cube to swap the product and
region dimensions would shift the focus from sales by product to sales by region.
Data Cube Computation methods :

1) Materialization of Cube: Full, Iceberg, Closed and Shell Cubes

A three-dimensional data cube with dimensions A, B, and C and an aggregate measure M. And can
think of a data cube as a lattice of cuboids. Each cube is meant to symbolize a group. The fundamental
cuboid, encompassing all three dimensions, is the ABC.

The aggregate measure (M) is calculated for each permutation of the three dimensions. There are six
different cuboids that make up a data cube, with the base cuboid being the most specific. The apex
cuboid is the most generalized cuboid. It stores a single number - the sum of all the tuples' measures in
the base cuboid's measure M. From the topmost cuboid of the data cube example, we can descend
into the lattice to access deeper levels of information.

13
R-22

Base Cell In Data Cube

The term "base cell" refers to a cell in the base cuboid. Aggregate cells are cells that are not
based on a cube. Each dimension that is aggregated in an aggregate cell is represented by a ""
in the cell notation. Let's pretend we're working with an n-dimensional data cube.

Let each cell of the cuboids that make up the data cube be denoted by a = (a1, a2,..., an,
measurements). If there are m (m n) values of a, b, c, d, e, f, g, h, I j, k, l, m, n, and o that are
not "," then we say that an is an m-dimensional cell (that is, from an m- A is a base cell if and
only if m = n; otherwise, it is an aggregate cell (where m n).

Cuboids In Data Cube


As an intriguing compromise between storage requirements and response times for OLAP, data
cubes that are partially materialized are a viable option. We cannot compute the entire data cube,
but rather only parts of it, called cuboids, each of which is composed of a subset of the cells in the full
cube.

Minimum Support And Iceberg In Data Cube

Not only does this result in a more effective utilization of resources (namely, CPU time and disc
space), but it also makes it possible to conduct more accurate analysis. There is a good chance that
the non-passing cells are not important enough to warrant further investigation.

Cubes that only partially materialize are referred to as iceberg cubes, and this phrase is used to
characterize such cubes. The term "minimum support," also abbreviated as "min sup" for short,
describes the criteria that are the absolute minimum acceptable. It is common practise to refer to the
effect of materializing only a fraction of the cells in a data cube as the "tip of the iceberg." In this
14
R-22

context, "iceberg" refers to the entire cube including all cells.

2) Roll-up/Drill-down- This method involves aggregating data along one or more dimensions to create a
summary of the dataset. It can be used to drill-down into specific areas of interest within the data.
Roll-up/Drill-down is useful for quickly summarizing large datasets into manageable chunks while still
maintaining important information about each dimension.

For example, if and have sales data for multiple products across several regions, and could use roll-
up/drill-down to see total sales across all regions or drill-down into sales numbers for one particular
product in one region.

3) Slice-and-Dice - This method involves selecting subsets of data based on certain criteria and then
analyzing it using different dimensions. It is useful for identifying patterns that may not be
immediately apparent when looking at the entire dataset.

Slice-and-Dice allows users to select subsets of data based on specific criteria such as time period
or customer demographics which can then be analyzed using different dimensions like product
categories or geographic locations. This helps identify patterns that may not be immediately apparent
when looking at the entire dataset.

4) Grouping Sets - This method involves grouping data by multiple dimensions at once, allowing for
more complex analysis of the dataset. Grouping Sets are useful when analyzing large datasets with
multiple dimensions where users want to group by two or more dimensions at once. For example,
grouping sets could show total revenue broken down by both product category and region
simultaneously.

5) Online Analytical Processing (OLAP) - This method uses a multidimensional database to store and
analyze large amounts of data. It allows for quick querying and analysis of the data in different
ways.OLAP databases are specifically designed for analyzing large amounts of multi-dimensional data
quickly through pre-aggregated values stored in memory making it ideal for real-time decision-
making scenarios like stock market analysis.

6) SQL Queries - SQL queries can be used to compute data cubes by selecting specific columns and
aggregating them based on certain criteria. This is a flexible method that can be customized based on
the needs of the user.
15
R-22

***

16

You might also like