0% found this document useful (0 votes)
13 views36 pages

Unit 2

The document discusses the architecture and processes involved in data warehousing, focusing on types of OLAP servers (ROLAP, MOLAP, HOLAP) and the three-tier data warehouse architecture (Bottom, Middle, Top Tier). It outlines the implementation steps for data warehouses, including requirements analysis, hardware integration, ETL processes, and tuning for performance and data load. Additionally, it highlights the challenges of tuning due to the dynamic nature of data warehouses and the need for effective query management.

Uploaded by

Aflah Sidhik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views36 pages

Unit 2

The document discusses the architecture and processes involved in data warehousing, focusing on types of OLAP servers (ROLAP, MOLAP, HOLAP) and the three-tier data warehouse architecture (Bottom, Middle, Top Tier). It outlines the implementation steps for data warehouses, including requirements analysis, hardware integration, ETL processes, and tuning for performance and data load. Additionally, it highlights the challenges of tuning due to the dynamic nature of data warehouses and the need for effective query management.

Uploaded by

Aflah Sidhik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

UNIT II DATA WAREHOUSE PROCESS AND

ARCHITECTURE
Types of OLAP servers,
OLAP Servers
Online Analytical Processing(OLAP) refers to a set of software tools used for data analysis
in order to make business decisions. OLAP provides a platform for gaining insights from
databases retrieved from multiple database systems at the same time. It is based on a
multidimensional data model, which enables users to extract and view data from various
perspectives. A multidimensional database is used to store OLAP data. Many Business
Intelligence (BI) applications rely on OLAP technology.

Type of OLAP servers:

The three major types of OLAP servers are as follows:

ROLAP

MOLAP

HOLAP

Relational OLAP (ROLAP):

Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a
relational database, where both the base data and dimension tables are stored as relational
tables. ROLAP servers are used to bridge the gap between the relational back-end server
and the client’s front-end tools. ROLAP servers store and manage warehouse data using
RDBMS, and OLAP middleware fills in the gaps.

Benefits:

It is compatible with data warehouses and OLTP systems.


The data size limitation of ROLAP technology is determined by the underlying RDBMS. As
a result, ROLAP does not limit the amount of data that can be stored.

Limitations:

SQL functionality is constrained.

It’s difficult to keep aggregate tables up to date.

Multidimensional OLAP (MOLAP):

Through array-based multidimensional storage engines, Multidimensional On-Line


Analytical Processing (MOLAP) supports multidimensional views of data. Storage utilization
in multidimensional data stores may be low if the data set is sparse.

MOLAP stores data on discs in the form of a specialized multidimensional array structure.
It is used for OLAP, which is based on the arrays’ random access capability. Dimension
instances determine array elements, and the data or measured value associated with each
cell is typically stored in the corresponding array element. The multidimensional array is
typically stored in MOLAP in a linear allocation based on nested traversal of the axes in
some predetermined order.
However, unlike ROLAP, which stores only records with non-zero facts, all array elements
are defined in MOLAP, and as a result, the arrays tend to be sparse, with empty elements
occupying a larger portion of them. MOLAP systems typically include provisions such as
advanced indexing and hashing to locate data while performing queries for handling sparse
arrays, because both storage and retrieval costs are important when evaluating online
performance. MOLAP cubes are ideal for slicing and dicing data and can perform complex
calculations. When the cube is created, all calculations are pre-generated.

Benefits:

Suitable for slicing and dicing operations.

Outperforms ROLAP when data is dense.

Capable of performing complex calculations.

Limitations:

It is difficult to change the dimensions without re-aggregating.

Since all calculations are performed when the cube is built, a large amount of data cannot
be stored in the cube itself

Hybrid OLAP (HOLAP):


ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP).
HOLAP offers greater scalability than ROLAP and faster computation than MOLAP.HOLAP
is a hybrid of ROLAP and MOLAP. HOLAP servers are capable of storing large amounts of
detailed data. On the one hand, HOLAP benefits from ROLAP’s greater scalability. HOLAP,
on the other hand, makes use of cube technology for faster performance and summary-type
information. Because detailed data is stored in a relational database, cubes are smaller than
MOLAP.

Benefits:

HOLAP combines the benefits of MOLAP and ROLAP.

Provide quick access at all aggregation levels.

Limitations

Because it supports both MOLAP and ROLAP servers, HOLAP architecture is extremely
complex.

There is a greater likelihood of overlap, particularly in their functionalities


3–Tier data warehouse architecture,
Data Warehouse Architecture

The Three-Tier Data Warehouse Architecture is the commonly used Data Warehouse
design in order to build a Data Warehouse by including the required Data Warehouse
Schema Model, the required OLAP server type, and the required front-end tools for
Reporting or Analysis purposes, which as the name suggests contains three tiers such as
Top tier, Bottom Tier and the Middle Tier that are procedurally linked with one another
from Bottom tier(data sources) through Middle tier(OLAP servers) to the Top tier(Front-end
tools).

Data Warehouse Architecture is the design based on which a Data Warehouse is built, to
accommodate the desired type of Data Warehouse Schema, user interface application and
database management system, for data organization and repository structure. The type of
Architecture is chosen based on the requirement provided by the project team. Three-tier
Data Warehouse Architecture is the commonly used choice, due to its detailing in the
structure. The three different tiers here are termed as:

Top-Tier

Middle-Tier

Bottom-Tier
Each Tier can have different components based on the prerequisites presented by the
decision-makers of the project but are subject to the novelty of their respective tier.

Three-Tier Data Warehouse Architecture

Here is a pictorial representation for the Three-Tier Data Warehouse Architecture

1. Bottom Tier

The Bottom Tier in the three-tier architecture of a data warehouse consists of the Data
Repository. Data Repository is the storage space for the data extracted from various data
sources, which undergoes a series of activities as a part of the ETL process. ETL stands for
Extract, Transform and Load. As a preliminary process, before the data is loaded into the
repository, all the data relevant and required are identified from several sources of the
system. These data are then cleaned up, to avoid repeating or junk data from its current
storage units. The next step is to transform all these data into a single format of storage.
The final step of ETL is to Load the data on the repository. Few commonly used ETL tools
are:

Informatica

Microsoft SSIS

Snaplogic

Confluent

Apache Kafka

Alooma

Ab Initio

IBM Infosphere

The storage type of the repository can be a relational database management system or a
multidimensional database management system. A relational database system can hold
simple relational data, whereas a multidimensional database system can hold data that
more than one dimension. Whenever the Repository includes both relational and
multidimensional database management systems, there exists a metadata unit. As the name
suggests, the metadata unit consists of all the metadata fetched from both the relational
database and multidimensional database systems. This Metadata unit provides incoming
data to the next tier, that is, the middle tier. From the user’s standpoint, the data from the
bottom tier can be accessed only with the use of SQL queries. The complexity of the
queries depends on the type of database. Data from the relational database system can be
retrieved using simple queries, whereas the multidimensional database system demands
complex queries with multiple joins and conditional statements.

2. Middle Tier

The Middle tier here is the tier with the OLAP servers. The Data Warehouse can have more
than one OLAP server, and it can have more than one type of OLAP server model as well,
which depends on the volume of the data to be processed and the type of data held in the
bottom tier. There are three types of OLAP server models, such as:

ROLAP

Relational online analytical processing is a model of online analytical processing which


carries out an active multidimensional breakdown of data stored in a relational database,
instead of redesigning a relational database into a multidimensional database.
This is applied when the repository consists of only the relational database system in it.

MOLAP

Multidimensional online analytical processing is another model of online analytical


processing that catalogs and comprises of directories directly on its multidimensional
database system.

This is applied when the repository consists of only the multidimensional database system
in it.

HOLAP

Hybrid online analytical processing is a hybrid of both relational and multidimensional


online analytical processing models.

When the repository contains both the relational database management system and the
multidimensional database management system, HOLAP is the best solution for a smooth
functional flow between the database systems. HOLAP allows storing data in both the
relational and the multidimensional formats.

The Middle Tier acts as an intermediary component between the top tier and the data
repository, that is, the top tier and the bottom tier respectively. From the user’s standpoint,
the middle tier gives an idea about the conceptual outlook of the database.

3. Top Tier

The Top Tier is a front-end layer, that is, the user interface that allows the user to connect
with the database systems. This user interface is usually a tool or an API call, which is used
to fetch the required data for Reporting, Analysis, and Data Mining purposes. The type of
tool depends purely on the form of outcome expected. It could be a Reporting tool, an
Analysis tool, a Query tool or a Data mining tool.

It is essential that the Top Tier should be uncomplicated in terms of usability. Only user-
friendly tools can give effective outcomes. Even when the bottom tier and middle tier are
designed with at most cautiousness and clarity, if the Top tier is enabled with a bungling
front-end tool, then the whole Data Warehouse Architecture can become an utter failure.
This makes the selection of the user interface/ front-end tool as the Top Tier, which will
serve as the face of the Data Warehouse system, a very significant part of the Three-Tier
Data Warehouse Architecture designing process.

Below are the few commonly used Top Tier tools.

IBM Cognos

Microsoft BI Platform

SAP Business Objects Web

Pentaho

Crystal Reports

SAP BW

SAS Business Intelligence

Distributed and virtual data warehouses.


Distributed data warehouse refers to the physical architecture of a single database. The data
in the warehouse is integrated across the enterprise, and an integrated view is used only at
central location of the enterprise.

Distributed Data warehouse:

Most organizations build and maintain a single centralized data warehouse environment.
This setup makes sense for many reasons:

The data in the warehouse is integrated across the corporation, and an integrated view is
used only at headquarters.

The corporation operates on a centralized business model.

The volume of data in the data warehouse is such that a single centralized repository of
data makes sense.

Even if data could be integrated, if it were dispersed across multiple local sites, it would be
cumbersome to access.
In short, politics, economics, and technology greatly favor a single centralized data
warehouse. Still, in a few cases, a distributed data warehouse makes sense.

Virtual Data warehouse

A virtual warehouse is another term for a data warehouse. A data warehouse is a computing
tool designed to simplify decision-making in business management. It collects and displays
business data relating to a specific moment in time, creating a snapshot of the condition of
the business at that moment. Virtual warehouses often collect data from a wide variety of
sources.

A virtual warehouse is essentially a business database. The data found in a virtual


warehouse is usually copied from multiple sources throughout a production system. This is
done so related data can be searched quickly and without accessing the entire system.
Performing a search of an entire production system at one time could potentially
compromise the system’s performance. Using a data warehouse removes this operating risk
and speeds up the overall access process.

Data Warehouse Implementation


There are various implementation in data warehouses which are as follows

Data Warehouse Implementation

1. Requirements analysis and capacity planning: The first process in data warehousing
involves defining enterprise needs, defining architectures, carrying out capacity
planning, and selecting the hardware and software tools. This step will contain be
consulting senior management as well as the different stakeholder.
2. Hardware integration: Once the hardware and software has been selected, they
require to be put by integrating the servers, the storage methods, and the user
software tools.
3. Modeling: Modelling is a significant stage that involves designing the warehouse
schema and views. This may contain using a modeling tool if the data warehouses are
sophisticated.

4. Physical modeling: For the data warehouses to perform efficiently, physical modeling
is needed. This contains designing the physical data warehouse organization, data
placement, data partitioning, deciding on access techniques, and indexing.

5.Sources: The information for the data warehouse is likely to come from several
data sources. This step contains identifying and connecting the sources using the
gateway, ODBC drives, or another wrapper.

6.ETL: The data from the source system will require to go through an ETL phase.
The process of designing and implementing the ETL phase may contain defining a
suitable ETL tool vendors and purchasing and implementing the tools. This may
contains customize the tool to suit the need of the enterprises.

7. Populate the data warehouses: Once the ETL tools have been agreed upon, testing
the tools will be needed, perhaps using a staging area. Once everything is working
adequately, the ETL tools may be used in populating the warehouses given the
schema and view definition.

8. User applications: For the data warehouses to be helpful, there must be end-user
applications. This step contains designing and implementing applications required
by the end-users.

9. Roll-out the warehouses and applications: Once the data warehouse has been
populated and the end-client applications tested, the warehouse system and the
operations may be rolled out for the user’s community to use.
Data Warehousing – Tuning

A data warehouse keeps evolving and it is unpredictable what query the user is going to
post in the future. Therefore it becomes more difficult to tune a data warehouse system. In
this chapter, we will discuss how to tune the different aspects of a data warehouse such as
performance, data load, queries, etc.

Difficulties in Data Warehouse Tuning

Tuning a data warehouse is a difficult procedure due to following reasons –

Data warehouse is dynamic; it never remains constant.

It is very difficult to predict what query the user is going to post in the future.

Business requirements change with time.

Users and their profiles keep changing.

The user can switch from one group to another.

The data load on the warehouse also changes with time.

Note – It is very important to have a complete knowledge of data warehouse.

Performance Assessment

Here is a list of objective measures of performance –


Average query response time

Scan rates

Time used per day query

Memory usage per process

I/O throughput rates

Following are the points to remember.

It is necessary to specify the measures in service level agreement (SLA).

It is of no use trying to tune response time, if they are already better than those required.

It is essential to have realistic expectations while making performance assessment.

It is also essential that the users have feasible expectations.

To hide the complexity of the system from the user, aggregations and views should be used.

It is also possible that the user can write a query you had not tuned for.

Data Load Tuning

Data load is a critical part of overnight processing. Nothing else can run until data load is
complete. This is the entry point into the system.

Note – If there is a delay in transferring the data, or in arrival of data then the entire system
is affected badly. Therefore it is very important to tune the data load first.
There are various approaches of tuning data load that are discussed below –

The very common approach is to insert data using the SQL Layer. In this approach, normal
checks and constraints need to be performed. When the data is inserted into the table, the
code will run to check for enough space to insert the data. If sufficient space is not
available, then more space may have to be allocated to these tables. These checks take time
to perform and are costly to CPU.

The second approach is to bypass all these checks and constraints and place the data
directly into the preformatted blocks. These blocks are later written to the database. It is
faster than the first approach, but it can work only with whole blocks of data. This can lead
to some space wastage.

The third approach is that while loading the data into the table that already contains the
table, we can maintain indexes.

The fourth approach says that to load the data in tables that already contain data, drop the
indexes & recreate them when the data load is complete. The choice between the third and
the fourth approach depends on how much data is already loaded and how many indexes
need to be rebuilt.

Integrity Checks

Integrity checking highly affects the performance of the load. Following are the points to
remember –

Integrity checks need to be limited because they require heavy processing power.

Integrity checks should be applied on the source system to avoid performance degrade of
data load.
Tuning Queries

We have two kinds of queries in data warehouse –

Fixed queries

Ad hoc queries

Fixed Queries

Fixed queries are well defined. Following are the examples of fixed queries –

Regular reports

Canned queries

Common aggregations

Tuning the fixed queries in a data warehouse is same as in a relational database system.
The only difference is that the amount of data to be queried may be different. It is good to
store the most successful execution plan while testing fixed queries. Storing these executing
plan will allow us to spot changing data size and data skew, as it will cause the execution
plan to change.

Ad hoc Queries

To understand ad hoc queries, it is important to know the ad hoc users of the data
warehouse. For each user or group of users, you need to know the following –

The number of users in the group

Whether they use ad hoc queries at regular intervals of time

Whether they use ad hoc queries frequently

Whether they use ad hoc queries occasionally at unknown intervals.

The maximum size of query they tend to run

The average size of query they tend to run

Whether they require drill-down access to the base data


The elapsed login time per day

The peak time of daily usage

The number of queries they run per peak hour

Data Warehousing – Testing

Testing is very important for data warehouse systems to make them work correctly and
efficiently. There are three basic levels of testing performed on a data warehouse –

Unit testing

Integration testing

System testing

Unit Testing

In unit testing, each component is separately tested.

Each module, i.e., procedure, program, SQL Script, Unix shell is tested.

This test is performed by the developer.

Integration Testing

In integration testing, the various modules of the application are brought together and then
tested against the number of inputs.

It is performed to test whether the various components do well after integration.


System Testing

In system testing, the whole data warehouse application is tested together.

The purpose of system testing is to check whether the entire system works correctly
together or not.

System testing is performed by the testing team.

Since the size of the whole data warehouse is very large, it is usually possible to perform
minimal system testing before the test plan can be enacted.

Data Staging (ETL) Design And Development,

ETL stands for Extract, Transform, Load and it is a process used in data warehousing to
extract data from various sources, transform it into a format suitable for loading into a data
warehouse, and then load it into the warehouse. The process of ETL can be broken down
into the following three stages:

Extract: The first stage in the ETL process is to extract data from various sources such as
transactional systems, spreadsheets, and flat files. This step involves reading data from the
source systems and storing it in a staging area.

Transform: In this stage, the extracted data is transformed into a format that is suitable for
loading into the data warehouse. This may involve cleaning and validating the data,
converting data types, combining data from multiple sources, and creating new data fields.

Load: After the data is transformed, it is loaded into the data warehouse. This step involves
creating the physical data structures and loading the data into the warehouse.

The ETL process is an iterative process that is repeated as new data is added to the
warehouse. The process is important because it ensures that the data in the data warehouse
is accurate, complete, and up-to-date. It also helps to ensure that the data is in the format
required for data mining and reporting.

Additionally, there are many different ETL tools and technologies available, such as
Informatica, Talend, DataStage, and others, that can automate and simplify the ETL
process.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a
process in which an ETL tool extracts the data from various data source systems,
transforms it in the staging area, and then finally, loads it into the Data Warehouse system.

Let us understand each step of the ETL process in-depth:

Extraction:

The first step of the ETL process is extraction. In this step, data from various source
systems is extracted which can be in various formats like relational databases, No SQL,
XML, and flat files into the staging area. It is important to extract the data from various
source systems and store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can be corrupted also.
Hence loading it directly into the data warehouse may damage it and rollback will be much
more difficult. Therefore, this is one of the most important steps of ETL process.

Transformation:

The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard format. It
may involve following processes/tasks:

Filtering – loading only certain attributes into the data warehouse.

Cleaning – filling up the NULL values with some default values, mapping U.S.A, United
States, and America into USA, etc.

Joining – joining multiple attributes into one.

Splitting – splitting a single attribute into multiple attributes.

Sorting – sorting tuples on the basis of some attribute (generally key-attribute).

Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is
finally loaded into the data warehouse. Sometimes the data is updated by loading into the
data warehouse very frequently and sometimes it is done after longer but regular intervals.
The rate and period of loading solely depends on the requirements and varies from system
to system.

ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it
can transformed and during that period some new data can be extracted. And while the
transformed data is being loaded into the data warehouse, the already extracted data can
be transformed. The block diagram of the pipelining of ETL process is shown below:

ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse builder,
CloverETL, and MarkLogic.

Data Warehouses: Most commonly used Data Warehouses are Snowflake, Redshift,
BigQuery, and Firebolt.
Designing an ETL (Extract, Transform, Load) process involves several key steps:

Define Requirements:

Clearly understand the business needs and data requirements. Identify data sources,
formats, and transformation rules.

Data Extraction:

Extract data from source systems using appropriate methods such as APIs, database
queries, or flat file imports.

Data Transformation:

Apply necessary transformations to the extracted data, such as cleaning, filtering, and
aggregating. Ensure data quality and consistency.

Data Loading:
Load the transformed data into the target system, which can be a data warehouse,
database, or other storage solutions.

Error Handling:

Implement mechanisms for handling errors during extraction, transformation, and loading.
Logging and alerting are crucial for monitoring and troubleshooting.

Scalability and Performance:

Design the ETL process to handle large volumes of data efficiently. Consider parallel
processing, batching, and optimization techniques for performance improvement.

Incremental Loading:

Implement strategies for incremental loading to update only the changed or new data,
reducing processing time and resource usage.

Metadata Management:

Maintain metadata to document the ETL process. This includes data lineage,
transformation rules, and source-target mappings.

Security:

Implement security measures to protect sensitive data during extraction, transformation,


and loading. This involves access controls, encryption, and compliance with data privacy
regulations.

Testing:

Conduct thorough testing, including unit testing for individual components, integration
testing for the entire ETL process, and validation against business requirements.

Monitoring and Maintenance:


Set up monitoring tools to track the ETL process’s performance and health. Establish a
maintenance plan for periodic updates, bug fixes, and adapting to changing business needs.

Documentation:

Document the ETL design, configuration, and any specific considerations for future
reference or handover to other teams.

Data warehouse visualization,


Data warehouse visualization involves presenting complex data in a visually accessible
format, often using tools like Tableau, Power BI, or Looker to create meaningful insights
and aid decision-making. It helps users analyze large datasets and discover patterns or
trends.

Types of data warehouse visualization,

Dashboards: Provide a comprehensive overview of key performance indicators (KPIs) and


metrics, allowing users to monitor and analyze data at a glance.

Charts and Graphs: Include bar charts, line graphs, pie charts, and other visual
representations to showcase trends, comparisons, and distributions within the data.

Heatmaps: Display data values using color gradients, making it easy to identify patterns,
concentrations, or anomalies in large datasets.

Geospatial Visualizations: Present data on maps to reveal geographical trends, variations, or


correlations, commonly used for location-based analytics.
Treemaps: Visualize hierarchical data structures by representing categories and
subcategories in nested rectangles, aiding in the understanding of data relationships.

Pivot Tables: Allow users to manipulate and explore multidimensional data interactively,
facilitating dynamic analysis and reporting.

Word Clouds: Represent word frequency in textual data, with larger font sizes for more
frequently occurring terms, offering a quick overview of common themes.

Bubble Charts: Combine three dimensions of data by using bubbles of different sizes and
colors, conveying information about quantity, category, and another variable.

Time Series Visualizations: Display data changes over time, using line charts or other time-
focused visualizations to highlight trends, seasonality, or anomalies.

Cohort Analysis: Track and compare groups of users who share a common characteristic
over time, providing insights into user behavior and performance metrics.

Histograms: Illustrate the distribution of a single variable, showing the frequency of data
within different ranges or bins.

Scatter Plots: Visualize relationships between two variables, revealing correlations, clusters,
or outliers.

DEPLOYMENT, MAINTENANCE, GROWTH OF


DATA WAREHOUSE
Deployment of data warehousing
Deploying a data warehouse involves setting up the infrastructure, loading data, and
ensuring efficient querying. Consider tools like Amazon Redshift, Snowflake, or Google
BigQuery based on your requirements. Design a robust ETL (Extract, Transform, Load)
process for data integration, and optimize for performance to support analytics. Regular
maintenance and monitoring are crucial for sustained efficiency.

Maintanence of data warehousing

Data warehousing maintenance includes tasks like monitoring performance, ensuring data
quality, and managing ETL processes. Regularly update and optimize indexes, review
storage usage, and address any issues promptly. Implement data backups and recovery
plans to safeguard against data loss. Stay current with software updates and periodically
reassess the data warehouse architecture to accommodate evolving business needs.
Regularly communicate with stakeholders to understand changing requirements and adapt
the data warehouse accordingly.

Growth of data warehousing

The growth of data warehousing involves scaling infrastructure, accommodating increased


data volumes, and adapting to evolving business needs. Consider vertical or horizontal
scaling of hardware, cloud resources, or distributed computing to handle larger datasets.
Implement efficient data partitioning and indexing strategies. Embrace technologies like
columnar storage and in-memory processing to enhance performance. Stay agile, regularly
evaluate new tools or features, and ensure that the data warehouse architecture can
seamlessly accommodate the growing demands of the organization.

The data warehouse development lifecycle is a framework that guides the development
process from inception through deployment and ongoing maintenance. It typically includes
the following phases:

Requirement analysis, Design (including modeling, Construction, Testing, Deployment,


Operation, Maintenance, Retirement.

The typical lifecycle for data warehouse deployment project includes:

Project scoping and planning

Requirement

Front-end design

Warehouse schema design

OLTP to data warehouse mapping

Implementation

Deployment

Management and maintenance of the system

Data warehousing is a flow process that gathers and manages structured and unstructured
data from numerous sources into a centralized repository. It makes actionable business
decisions
Requirement Specification: It is the first step in the development of the Data Warehouse
and is done by business analysts. In this step, Business Analysts prepare business
requirement specification documents. More than 50% of requirements are collected from
the client side and it takes 3-4 months to collect all the requirements. After the
requirements are gathered, the data modeler starts recognizing the dimensions, facts &
combinations based on the requirements. We can say that this is an overall blueprint of the
data warehouse. But, this phase is more about determining business needs and placing
them in the data warehouse.

Data Modelling: This is the second step in the development of the Data Warehouse. Data
Modelling is the process of visualizing data distribution and designing databases by
fulfilling the requirements to transform the data into a format that can be stored in the data
warehouse. For example, whenever we start building a house, we put all the things in the
correct position as specified in the blueprint. That’s what data modeling is for data
warehouses. Data modelling helps to organize data, creates connections between data sets,
and it’s useful for establishing data compliance and its security that line up with data
warehousing goals. It is the most complex phase of data warehouse development. And,
there are many data modelling techniques that businesses use for warehouse design. Data
modelling typically takes place at the data mart level and branches out in a data warehouse.
It’s the logic of how the data is stored concerning other data. There are three data models
for data warehouses:

Star Schema

Snowflake Schema
Galaxy Schema.

ELT Design and Development: This is the third step in the development of the Data
Warehouse. ETL or Extract, Transfer, Load tool may extract data from various source
systems and store it in a data lake. An ETL process can extract the data from the lake, after
that transform it and load it into a data warehouse for reporting. For optimal speeds, good
visualization, and the ability to build easy, replicable, and consistent data pipelines between
all of the existing architecture and the new data warehouse, we need ELT tools. This is
where ETL tools like SAS Data Management, IBM Information Server, Hive, etc. Come into
the picture. A good ETL process can be helpful in constructing a simple yet functional data
warehouse that’s valuable throughout every layer of the organization.

OLAP Cubes: This is the fourth step in the development of the Data Warehouse. An OLAP
cube, also known as a multidimensional cube or hypercube, is a data structure that allows
fast analysis of data according to the multiple dimensions that define a business problem. A
data warehouse would extract information from multiple data sources and formats like text
files, excel sheets, multimedia files, etc. The extracted data is cleaned and transformed and
is loaded into an OLAP server (or OLAP cube) where information is pre-processed in
advance for further analysis. Usually, data operations and analysis are performed using a
simple spreadsheet, where data values are arranged in row and column format. This is ideal
for two-dimensional data. However, OLAP contains multidimensional data, with data
typically obtained from different and unrelated sources. Employing a spreadsheet isn’t an
optimum choice. The cube will store and analyze multidimensional data in a logical and
orderly manner. Now, data warehouses are now offered as a fully built product that is
configurable and capable of staging multiple types of data. OLAP cubes are becoming
outdated as OLAP cubes can’t deliver real-time analysis and reporting, as businesses are
now expecting something with high performance.

UI Development: This is the fifth step in the development of the Data Warehouse. So far,
the processes discussed have taken place at the backend. There is a need for a user
interface for how the user and a computer system interact, in particular the use of input
devices and software, to immediately access the data warehouse for analysis and generating
reports. The main aim of a UI is to enable a user to effectively manage a device or machine
they’re interacting with. There are plenty of tools in the market that helps with UI
development. BI tools like Tableau or PowerBI for those using BigQuery are great choices.

Maintenance: This is the sixth step in the development of the Data Warehouse. In this
phase, we can update or make changes to the schema and data warehouse’s application
domain or requirements. Data warehouse maintenance systems must provide means to
keep track of schema modifications as well, for instance, modifications. At the schema level,
we can perform operations for the Insertion, and change dimensions and categories.
Changes are, for example, adding or deleting user-defined attributes.

Test and Deployment: This is often the ultimate step in the Data Warehouse development
cycle. Businesses and organizations test data warehouses to ensure whether the required
business problems are implemented successfully or not. The warehouse testing involves the
scrutiny of enormous volumes of data. Data that has to be compared comes from
heterogeneous data sources like relational databases, flat files, operational data, etc. The
overall data warehouse project testing phases include: Data completeness, Data
Transformation, Data is loaded by means of ETL tools, Data integrity, etc. After testing the
data warehouse, we deployed it so that users could immediately access the data and
perform analysis. Basically, in this phase, the data warehouse is turned on and lets the user
take the benefit of it. At the time of data warehouse deployment, most of its functions are
implemented. The data warehouses can be deployed at their own data center or on the
cloud.

Business Intelligence Overview


What is Business Intelligence?
Business Intelligence is the talk of a new changing and growing world that can be defined
as a set of concepts and methodologies to improve decision-making in business through the
use of facts and fact-based systems. The Goal of Business Intelligence is to improve
decision-making in business ideas and analysis. Business Intelligence is not just a concept
it’s a group of concepts and methodologies. Business Intelligence uses analytics and gut
feelings for making decisions. Process Used in Business Intelligence: BI(Business
Intelligence) uses a set of processes, technologies, and tools (such as Informatica/IBM) to
transform raw data into meaningful information and then transform information to provide
knowledge. Then afterward some beneficial insights can be extracted manually and by
some software then the decision-makers can make an impactful decision on the basis of
insights.

What-is-Business-Intelligence To sound short and clear – Business Intelligence about


provides accurate information in the right and ethical format to the decision-makers of the
organization. Some Important features of Business Intelligence are:
Fact-based decision making.

360 degrees perspective on your business.

Virtual team members are on the same page.

Measurement for creating KPI (Key Performance Indicators) on the basis of historic data
fed into the system.

Identify the benchmark and then set the benchmarks for different processes.

Business Intelligence systems can use to identify market trends and also to spot business
problems that need to be identified and solved.

Business Intelligence helps in data visualization will increase the quality of data and then
also increases the quality of decision making.

Business Intelligence systems can be used by large enterprises, and organizations along
with Small and Medium Enterprises, because it is quite affordable.

Types of Users of Business Intelligence:

Analyst (Data Analyst or Business Analyst): They are the statistician of the company, they
used BI on the basis of historical data priorly stored in the system.

Head or Manager of the Company: Head of the company uses Business Intelligence used to
increase the profitability of their company by increasing the efficiency in their decisions on
the basis of all the knowledge they discovered.

IT Engineer: For his company.

Small Business Owners: Can be used by a small businessman because it is quite affordable
too.

Government Officials: In the decision-making of the government.

Types of Decisions Supported by Business Intelligence:

Strategic Level: The strategic level is the level where the Heads of the company decide the
strategies of any business.

Tactical Level: Once the strategy is made though for handling all the details and matters
have a tactical level where all the technologies and methodologies come under one
umbrella. This level is further responsible for continuously updating the data.

Operational Level: Operation decisions are made at this level. Operational decisions help in
operating the system.
Applications of Business Intelligence:

In Decision Making of the company by decision-makers of the organizations.

In Data Mining while extracting knowledge.

In Operational Analytics and operational management.

In Predictive Analytics.

In Prescriptive Analytics.

Making Structured data from unstructured data.

In Decision Support System.

In Executive Information System (EIS).

Business Intelligence Tools and Software

Tableau: A business intelligence and data visualization application that enables users to
connect to different data sources, build interactive dashboards, and share findings with
others.

Microsoft Power BI: A cloud-based business intelligence program that enables users to
connect to a variety of data sources, produce visualizations, and communicate findings.

QlikView is a business intelligence and data visualization platform that enables users to
build interactive dashboards and examine data in novel ways.

Data visualization, reporting, and analytics tools are all included in SAP BusinessObjects, a
complete business intelligence suite.

IBM Cognos: A tool for performance management and corporate intelligence that enables
users to build reports, scorecards, and dashboards.

Data visualization, reporting, and analytics technologies are all part of the full business
intelligence suite known as Oracle Business Intelligence.

Create dynamic dashboards and reports with MicroStrategy, a business intelligence and
data visualization tool.

Data visualization, reporting, and analytics tools are all part of the full business intelligence
suite known as SAS Business Intelligence.

TIBCO Spotfire is a business intelligence and data visualization platform that enables users
to build interactive dashboards and investigate data in novel ways.
Looker: A tool for business intelligence and data visualization that enables users to build
interactive dashboards and investigate data in novel ways.

Data Warehousing and Business Intelligence


Trends
Here are some business intelligence trends:

Artificial intelligence (AI)

AI is rapidly changing how businesses operate. It can automate routine tasks and provide
insights for strategic decision-making.

Data governance

Data governance is becoming more important as security and privacy regulations become
stricter. It’s necessary for companies that want to invest in business intelligence and get
meaningful results.

Natural language processing (NLP)

NLP is a computer-aided processing of natural language to facilitate communication


between humans and machines.

Advanced analytics

Advanced analytics is a critical component of modern business intelligence. It provides data


scientists with a platform they can depend on, and business users with tools to simplify
their workflows.

Data discovery

Data discovery techniques and visualization can empower businesses by producing relevant
insights for long-term decision-making.

Data quality management (DQM)

DQM was one of the hottest business intelligence trends in 2019. Businesses want to
implement data quality processes to enhance their ability to utilize business intelligence.

Data security

Data security is an important component in the Business Intelligence implementation. Row


Level Security makes sure users see data they are supposed to see and not any extra.
Automation

Automation is a key component of any business intelligence tool. It allows you to create a
repeatable process that can be applied across your organization.

Other business intelligence trends include:

Analyzing customer behavior

Big data needs metadata

Data, a new commodity

A story of data to be told

Some data warehousing trends for 2023 include:

Predictive analytics: Using machine learning models and algorithms to analyze historical
data to predict demand, optimize inventory levels, and anticipate supply chain disruptions

Cloud-based data warehousing: A recent study by Gitnux highlighted this as one of the
most important data warehouse trends

Hybrid data warehousing: Another important data warehouse trend according to the Gitnux
study

Real-time data processing: Another important data warehouse trend according to the
Gitnux study

Self-service analytics: Another important data warehouse trend according to the Gitnux
study

Artificial intelligence and machine learning integration: Another important data warehouse
trend according to the Gitnux study

Data virtualization: Another important data warehouse trend according to the Gitnux study

Metadata management: Another important data warehouse trend according to the Gitnux
study

Data warehouses are centralized storage systems that allow for the storing, analyzing, and
interpreting of data. They provide a comprehensive view of an organization’s data, enabling
valuable insights and data-driven decisions
BUSINESS
INTELLIGENCE
DATA
WAREHOUSING

It is a set of tools It is a system for


and methods to storage of data from
analyze data and various sources in
discover, extract an orderly manner
and formulate as to facilitate
actionable business-minded
information that reads and writes
would be useful for
business decisions.
It is a Decision It is a data storage
Support System system
(DSS).
Serves at the frontServes at the back
end. end.
The aim of A data warehouse’s
business main aim is to
intelligence is to provide the users of
enable users to business
make informed, intelligence; a
data driven structured and
decisions comprehensive
view of available
data of an
organization.
Collects data from Collects data from
the data warehouse various disparate
for analysis. sources and
organizes it for
efficient BI analysis.
Comprises business Comprises of data
reports, charts, held in “fact tables”
graphs, etc. and “dimensions”
with business
meaning
incorporated into
them.
BI as such doesn’t BI is one of many
have much use use-cases for data
without a data warehouses, there
warehouse as large are more
amounts of various applications for this
and useful data is system.
required for
analysis.
Handled by Handled and
executives and maintained by data
analysts relatively engineers and
higher up in the system
hierarchy. administrators who
report to/work for
the executives and
analysts.
The role of The reflection of
Business actual database
Intelligence lies in development and
improving the integration process
performance of is given by Data
business by Warehouse and in
utilizing tools and addition, Data
approaches that Profiling and
focus on counts, Company validation
statistics, and standards.
visualization.
It deals with- It deals with-

OLAP (Online Acquiring/gathering


Analytical of data
Processing) Metadata
Data Visualization management
Data Mining Cleaning of data
Query/Reporting Transforming data
Tools Data dissemination
Data
recovery/backup
planning
Examples of BI Examples of Data
software: SAP, warehouse software:
Sisense, Datapine, BigQuery,
Looker, etc. , Snowflake, Amazon
Redshift, Panoply,
etc.
SAS
SAS, which stands for Statistical Analysis System, is software that helps with data
management, statistical analysis, and reporting. SAS has been used by businesses and
organizations of all sizes for more than 40 years.

The SAS System provides a data warehouse model that addresses the entire scope of
warehouse management, organization, and exploitation. It is a comprehensive model,
supported by a broad array of products.

SAS/INSIGHT is a data mining tool. The SAS System has also become a pervasive tool for
managing and mining.

A data warehouse is used to store large amounts of data that are collected from multiple
resources. Data is typically stored in a data warehouse through an extract, transform, and
load (ETL) process.

SAS, or Statistical Analysis System, is a software suite that facilitates data management,
advanced analytics, and business intelligence. It includes tools for statistical analysis, data
visualization, and reporting. SAS is widely used in fields such as finance, healthcare, and
research for handling, analyzing, and interpreting data to make informed decisions.

Tools of SAS

SAS offers a comprehensive suite of tools for various data-related tasks. Some key
components include:

Base SAS: Core functionality for data management, analytics, and reporting.

SAS/STAT: Module for statistical analysis, including regression, analysis of variance, and
multivariate analysis.

SAS/SQL: Provides a SQL interface for querying and manipulating data within SAS.

SAS/GRAPH: Enables data visualization through various charts, plots, and maps.
SAS/ETS: Focuses on time series analysis and econometrics.

SAS/IML: Interactive Matrix Language for matrix-based programming and advanced


analytics.

SAS Enterprise Guide: A point-and-click interface for data analysis and reporting.

SAS Enterprise Miner: Used for data mining and machine learning tasks.

SAS Forecast Studio: A tool for forecasting and predictive modeling.

SAS Visual Analytics: Supports interactive, exploratory data analysis and visual reporting.

These tools cater to different aspects of data handling and analysis, making SAS a versatile
platform for diverse applications.

You might also like