DATA WAREHOUE UNIT-2
DATA WAREHOUSE
A data warehouse is a repository (data & metadata) that contains
integrated, cleansed, and reconciled data from disparate sources for
decision support applications, with an emphasis on online analytical
processing.
A data warehouse is a centralized system used for storing and
managing large volumes of data from various sources. It is designed
to help businesses analyze historical data and make informed
decisions. Data from different operational systems is collected,
cleaned, and stored in a structured way, enabling efficient querying
and reporting.
Typically, the data is multidimensional, historical, non-volatile. It's a
subject-oriented, integrated, time-variant, and non-volatile collection
of data designed to support management's decision-making process.
Data warehouses are typically used for business intelligence (BI),
reporting, and data mining.
Key Features of a Data Warehouse:
Subject-Oriented: Focused on specific business subjects or areas for
analysis.
Integrated: Data from multiple sources is combined into a single,
consistent view.
Time-Variant: Historical data is stored, allowing for trend analysis
over time.
Non-Volatile: Once data is loaded into the warehouse, it is not
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 1
DATA WAREHOUE UNIT-2
typically updated or deleted.
Components of Data Warehouse
main components of a data warehouse include:
Data Sources: These are the various operational
systems, databases, and external data feeds that provide raw data
to be stored in the warehouse.
ETL (Extract, Transform, Load) Process: The ETL process is
responsible for extracting data from different sources,
transforming it into a suitable format, and loading it into the data
warehouse.
Data Warehouse Database: This is the central repository where
cleaned and transformed data is stored. It is typically organized in
a multidimensional format for efficient querying and reporting.
Metadata: Metadata describes the structure, source, and usage
of data within the warehouse, making it easier for users and
systems to understand and work with the data.
Data Marts: These are smaller, more focused data repositories
derived from the data warehouse, designed to meet the needs of
specific business departments or functions.
OLAP (Online Analytical Processing) Tools: OLAP tools allow
users to analyze data in multiple dimensions, providing deeper
insights and supporting complex analytical queries.
End-User Access Tools: These are reporting and analysis tools,
such as dashboards or Business Intelligence (BI) tools , that enable
business users to query the data warehouse and generate reports.
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 2
DATA WAREHOUE UNIT-2
Benefits of using a Data Warehouse:
Improved Decision-Making:
Data warehouses provide a comprehensive and reliable source of
information for decision-making.
Better Business Intelligence:
Data warehouses enable businesses to identify trends, patterns, and
insights that can lead to better business performance.
Simplified Reporting:
Data warehouses provide a centralized and consistent source of data
for generating reports and dashboards.
Data Integration:
They enable the integration of data from various sources, creating a
unified view of the business.
Types of Data Warehouses:
1. Enterprise Data Warehouse (EDW): Large-scale data
warehouses that support multiple business units.
2. Operational Data Store (ODS): Used to provide real-time data
for operational purposes.
3. Data Mart: A smaller, subject-oriented data warehouse focused on
a specific business area.
4. Cloud Data Warehouse: A data warehouse hosted in the cloud,
offering scalability and flexibility.
5. Big Data Warehouse: Designed to store vast amounts of
unstructured and structured data for big data analysis.
6. Virtual Data Warehouse: Provides access to data from multiple
sources without physically storing it.
7. Hybrid Data Warehouse: Combines on-premises and cloud-
based storage to offer flexibility.
8. Real-time Data Warehouse: Designed to handle real-time data
streaming and analysis for immediate insights.
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 3
DATA WAREHOUE UNIT-2
Advantages of Data Warehousing
Intelligent Decision-Making: With centralized data in
warehouses, decisions may be made more quickly and
intelligently.
Business Intelligence: Provides strong operational insights
through business intelligence.
Data Quality: Guarantees data quality and consistency for
trustworthy reporting.
Scalability: Capable of managing massive data volumes and
expanding to meet changing requirements.
Effective Queries: Fast and effective data retrieval is made
possible by an optimized structure.
Disadvantages of Data Warehousing
Cost: Building a data warehouse can be expensive, requiring
significant investments in hardware, software, and personnel.
Complexity: Data warehousing can be complex, and businesses
may need to hire specialized personnel to manage the system.
Time-consuming: Building a data warehouse can take a
significant amount of time, requiring businesses to be patient and
committed to the process.
Data integration challenges: Data from different sources can
be challenging to integrate, requiring significant effort to ensure
consistency and accuracy.
Data security: Data warehousing can pose data security risks,
and businesses must take measures to protect sensitive data from
unauthorized access or breaches .
Multi-Dimensional Data Model
A Multidimensional Data Model is defined as a model that allows data to be
organized and viewed in multiple dimensions, such as product, time and location.
They are numerical measures, such as sales amount or quantity
sold. This model facilitates interactive analysis of data cubes,
enabling users to explore data from different perspectives.
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 4
DATA WAREHOUE UNIT-2
A multidimensional model views data in the form of a data-cube. A
data cube enables data to be modelled and viewed in multiple
dimensions. It is defined by dimensions and facts are numerical
measures, such as sales amount or quantity sold. This model
facilitates interactive analysis of data cubes, enabling users to explore
data from different perspectives.
A multidimensional data model is organized around a central theme,
for example, sales. This theme is represented by a fact table. Facts
are numerical measures. The fact table contains the names of the
facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of
Delhi. The data is shown in the table. In this 2D representation, the
sales for Delhi are shown for the time dimension (organized in
quarters) and the item dimension (classified according to the types of
an item sold). The fact or measure displayed in rupee sold (in
thousands).
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 5
DATA WAREHOUE UNIT-2
Key Concepts:
Data Cube:
A structure that represents the multidimensional data, allowing for
slicing, dicing, and drilling down/up.
Dimension Tables:
Tables that store the descriptive attributes of each dimension.
Fact Table:
A central table that stores the numerical measures (facts) and links
to the dimension tables.
Star Schema:
A common data warehouse schema where a central fact table is
linked to multiple dimension tables.
Snowflake Schema:
A more normalized version of the star schema, where dimension
tables are further broken down into smaller tables.
By organizing data in this way, data warehouses using
multidimensional models can provide users with a powerful and
flexible way to explore and analyze their data, leading to better
decision-making.
Data cube
A data cube allows data to be modelled and viewed in multiple
dimensions
• It is defined by dimensions and facts
• dimensions — perspectives or entities with respect to which an
organization wants to keep records
• facts — quantities by which we want to analyse relationships
between dimensions
• data cube is often referred to as a cuboid
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 6
DATA WAREHOUE UNIT-2
Sachems for Multidimensional Data Models
1) Star schema
Star schema contains one-dimensional tables that are combined
together by adding a one-dimensional table which contains keys to all
the other connected tables. These tables are arranged in the form of a
star. So in this following example you see Anastasia schema which
shows the sales of a company. Sales represents the cuboid and time,
item, branch and location are the dimensions(attributes) of that
cuboid.
Figure — 4 (Star schema)
• the large central table containing the bulk of the data we call as fact
table, With some amount of redundancy.
• a set of smaller attendant tables (dimension tables), one for each
dimension
2) Snowflake schema
In his North-lake schema the dimension are normalized. Therefore, the
dimensions splatted attributes. hence the data redundancy is reduced
greatly.
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 7
DATA WAREHOUE UNIT-2
Figure — 5 (snow flake schema)
• a variant of the star schema-model
• some dimension tables are normalized (splitting the data into
additional tables)
3) Fact Constellation schema
In fact, constellation schema there can be multiple fact tables. This is
also called Galaxy schema. this is same as start schema but there are
multiple fact tables.
Figure — 5 (Fact Constellation schema)
• multiple fact tables to share dimension tables
• can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact Constellation
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 8
DATA WAREHOUE UNIT-2
It becomes a little bit harder to visualize multiple dimensions. Hence
cuboids are represented as follows. There are multi dimensions
involved into this and it becomes quite completed when it comes to a
larger schema. This is called a “Lattice of cuboids”.
Figure — 6 (Lattice of cuboids)
Schema Definition
Multidimensional schema is defined using Data Mining Query
Language (DMQL). The two primitives, cube definition and dimension
definition, can be used for defining the data warehouses and data
marts.
Different operations can be applied to data cubes. Most common
operations are listed as follows.
👉🏻 OLAP Operations
Figure — 7 (OLAP Operations)
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 9
DATA WAREHOUE UNIT-2
1. Roll-up (drill-up):
Figure — 7 (Drill — up OLAP Operation)
performs aggregation on a data cube, either by climbing up a
concept hierarchy for a dimension or by dimension reduction
• Drill-down:
Figure — 8 (Drill — down OLAP Operation)
can be realized by either stepping down a concept hierarchy
for a dimension or introducing additional dimensions
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 10
DATA WAREHOUE UNIT-2
• Slice and dice:
Figure — 9 (Slice and Dice OLAP Operation)
The slice operation performs a selection on one dimension of
the given cube, resulting in a sub cube and the dice operation
defines a sub cube by performing a selection on two or more
dimensions.
• Pivot (rotate):
Figure — 9 (Pivot OLAP Operation)
a visualization operation that rotates the data axes in view to
provide an alternative data presentation
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 11
DATA WAREHOUE UNIT-2
Data cleaning
Data cleaning, also known as data cleansing or data scrubbing, is the
process of identifying and correcting (or removing) errors,
inconsistencies, and inaccuracies within a dataset.
Data cleaning in a data warehouse involves identifying and correcting
errors, inconsistencies, and inconsistencies in the data before it is
loaded into the warehouse. This process ensures the accuracy and
reliability of the data, which is crucial for informed decision-
making. Data cleaning can be done manually or with the help of data
cleaning tools, which are more efficient and effective.
Data cleaning is the process of correcting or deleting inaccurate,
damaged, improperly formatted, duplicated, or insufficient data from
a dataset. Even if results and algorithms appear to be correct, they
are unreliable if the data is inaccurate. There are numerous ways for
data to be duplicated or incorrectly labelled when merging multiple
data sources.
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 12
DATA WAREHOUE UNIT-2
Why is data cleaning important?
Ensures Data Quality:
Data cleaning helps maintain the accuracy and completeness of the
data, preventing the introduction of errors into the data warehouse.
Improves Data Reliability:
By correcting errors and inconsistencies, data cleaning ensures that
the data warehouse contains reliable and consistent information,
which is essential for accurate reporting and analysis.
Saves Time and Resources:
Cleaning the data before loading it into the warehouse saves time
and resources that would otherwise be spent correcting errors after
the data is in the warehouse.
Enhances Data Analysis:
Clean data enables more accurate and reliable data analysis, leading
to better insights and informed decision-making.
Common Data Cleaning Techniques:
1. Handling Missing Data
Missing data is a common problem in datasets. Strategies to handle
missing data include:
Removing Records: Deleting rows with missing values if they are
relatively few and insignificant.
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 13
DATA WAREHOUE UNIT-2
Imputing Values: Replacing missing values with estimated ones,
such as the mean, median, or mode of the dataset.
Using Algorithms: Employing advanced techniques like
regression or machine learning models to predict and fill in
missing values.
2. Removing Duplicates
Duplicates can skew analyses and lead to inaccurate results.
Identifying and removing duplicate records ensures that each data
point is unique and accurately represented.
3. Correcting Inaccuracies
Data entry errors, such as typos or incorrect values, need to be
identified and corrected. This can involve cross-referencing with
other data sources or using validation rules to ensure data accuracy.
4. Standardizing Formats
Data may be entered in various formats, making it difficult to
analyze. Standardizing formats, such as dates, addresses, and
phone numbers, ensures consistency and makes the data easier to
work with.
5. Dealing with Outliers
Outliers can distort analyses and lead to misleading results.
Identifying and addressing outliers, either by removing them or
transforming the data, helps maintain the integrity of the dataset.
Identifying and Removing Duplicates:
Duplicate data can skew analysis and should be identified and
removed.
Standardizing Data Formats:
Ensuring that all data has a consistent format, such as date formats
or phone number formats, improves data consistency.
Handling Missing Values:
Deciding how to handle missing data, whether to fill it in with a
default value or remove the record, is an important step in data
cleaning.
Correcting Spelling Errors and Other Typos:
Typos and other errors can be corrected to ensure that the data is
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 14
DATA WAREHOUE UNIT-2
accurate.
Filtering Outliers:
Outliers, or data points that fall significantly outside the expected
range, should be identified and addressed to ensure data accuracy.
Tools for Data Cleaning:
Data Cleaning Tools:
Various data cleaning tools can automate the process of identifying
and correcting errors in data.
Data Quality Software:
Dedicated data quality software can help with data cleansing,
matching, and de-duplication.
Open-Source Tools:
Open Refine is a popular open-source tool for data cleaning and
transformation.
Tools and Techniques for Data Cleaning
Software Tools
Several software tools are available to aid in data cleaning. Some of the most
popular ones include:
Microsoft Excel: Offers basic data cleaning functions such as removing
duplicates, handling missing values, and standardizing formats.
OpenRefine: An open-source tool designed specifically for data cleaning and
transformation.
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 15
DATA WAREHOUE UNIT-2
Python Libraries: Libraries like Pandas and NumPy provide powerful
functions for data cleaning and manipulation.
R: The R programming language offers robust packages for data cleaning,
such as dplyr and tidyr.
Techniques
Effective data cleaning also involves various techniques, such as:
Regular Expressions: Useful for pattern matching and text manipulation.
Data Profiling: Involves examining data to understand its structure, content,
and quality.
Data Auditing: Systematically checking data for errors and inconsistencies.
Data Cleaning: Removing inconsistencies and errors from data.
Data Transformation: Converting data into a suitable format for
analysis.
Data Loading: Importing data into a central repository.
Data Matching: Identifying and merging duplicate records.
Data Consolidation: Aggregating data from different sources.
Data Integration
In data warehousing, data integration combines data from various
sources, while data transformation prepares it for analysis by
cleaning, formatting, and structuring it into a usable format.
Data integration is a critical process in data mining that involves
combining data from multiple sources to provide a unified view. This
process ensures that data is accurate, consistent, and usable for
analysis. Effective data integration can significantly enhance the
quality of insights derived from data mining activities.
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 16
DATA WAREHOUE UNIT-2
In data warehousing, data integration combines data from various
sources, while data transformation prepares it for analysis by
cleaning, formatting, and structuring it into a usable format.
There are mainly 2 major approaches for data integration – one is
the “tight coupling approach” and another is the “loose coupling
approach”.
Tight Coupling:
This approach involves creating a centralized repository or data
warehouse to store the integrated data. The data is extracted from
various sources, transformed and loaded into a data warehouse.
Data is integrated in a tightly coupled manner, meaning that the
data is integrated at a high level, such as at the level of the entire
dataset or schema. This approach is also known as data
warehousing, and it enables data consistency and integrity, but it
can be inflexible and difficult to change or update.
Here, a data warehouse is treated as an information retrieval
component.
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 17
DATA WAREHOUE UNIT-2
In this coupling, data is combined from different sources into a
single physical location through the process of ETL – Extraction,
Transformation, and Loading.
Loose Coupling:
This approach involves integrating data at the lowest level, such as
at the level of individual data elements or records. Data is integrated
in a loosely coupled manner, meaning that the data is integrated at a
low level, and it allows data to be integrated without having to
create a central repository or data warehouse. This approach is also
known as data federation, and it enables data flexibility and easy
updates, but it can be difficult to maintain consistency and integrity
across multiple data sources.
Here, an interface is provided that takes the query from the user,
transforms it in a way the source database can understand, and
then sends the query directly to the source databases to obtain
the result.
And the data only remains in the actual source databases.
Data transformation
Data transformation in data mining refers to the process of
converting raw data into a format that is suitable for analysis and
modelling. It also ensures that data is free of errors and
inconsistencies. The goal of data transformation is to prepare the
data for data mining so that it can be used to extract useful insights
and knowledge.
Data transformation in a data warehouse involves converting data
from one format or structure to another, primarily within the ETL
(Extract, Transform, Load) process. This process ensures data is
compatible with the target data warehouse, enhances data quality
and usability, and prepares data for analysis.
Data transformation is the process of converting, cleansing, and
structuring data into a usable format that can be analysed to support
decision making processes, and to propel the growth of an
organization.
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 18
DATA WAREHOUE UNIT-2
Data transformation is used when data needs to be converted to
match that of the destination system. This can occur at two places of
the data pipeline. First, organizations with on-site data storage use an
extract, transform, load, with the data transformation taking place
during the middle ‘transform’ step.
Here's a more detailed explanation:
Purpose of Data Transformation:
Data Integration:
Transforming data allows for the combination of data from multiple
sources into a single, unified view within the data warehouse.
Data Quality:
Transformation includes cleaning, validating, and preparing data to
ensure accuracy and consistency, making it suitable for analysis.
Data Compatibility:
Converting data formats and structures ensures it aligns with the
data warehouse's schema, facilitating efficient querying and
reporting.
Data Usability:
Transforming data makes it easier for users to understand and
interpret, enabling better data-driven decision-making.
Key Steps in Data Transformation:
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 19
DATA WAREHOUE UNIT-2
Extraction: Raw data is extracted from various source systems.
Transformation: This is where data is converted into a consistent
and usable format.
Cleaning: Removing duplicate data, handling missing values, and
correcting inconsistencies.
Validating: Ensuring data adheres to predefined rules and
constraints.
Structuring: Converting data into a suitable format for the data
warehouse schema.
Data Type Conversion: Converting data types (e.g., string to
numeric) for compatibility.
Loading: The transformed data is loaded into the data warehouse.
The data transformation process is carried out in five stages.
1. Discovery
The first step is to identify and understand data in its original source
format with the help of data profiling tools. Finding all the sources and
data types that need to be transformed. This step helps in
understanding how the data needs to be transformed to fit into the
desired format.
2. Mapping
The transformation is planned during the data mapping phase. This
includes determining the current structure, and the consequent
transformation that is required, then mapping the data to understand
at a basic level, the way individual fields would be modified, joined or
aggregated.
3. Code generation
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 20
DATA WAREHOUE UNIT-2
The code, which is required to run the transformation process, is
created in this step using a data transformation platform or tool.
4. Execution
The data is finally converted into the selected format with the help of
the code. The data is extracted from the source(s), which can vary
from structured to streaming, telemetry to log files. Next,
transformations are carried out on data, such as aggregation, format
conversion or merging, as planned in the mapping stage. The
transformed data is then sent to the destination system which could
be a dataset or a data warehouse.
Some of the transformation types, depending on the data involved,
include:
Filtering which helps in selecting certain columns that require
transformation
Enriching which fills out the basic gaps in the data set
Splitting where a single column is split into multiple or vice versa
Removal of duplicate data, and
Joining data from different sources
5. Review
The transformed data is evaluated to ensure the conversion has had
the desired results in terms of the format of the data.
It must also be noted that not all data will need transformation, at
times it can be used as is.
Benefits of Data Transformation:
Improved Data Quality: Ensures data is accurate, consistent, and
reliable.
Efficient Data Analysis: Facilitates easier and faster analysis of
data.
Enhanced Business Intelligence: Provides a single source of truth
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 21
DATA WAREHOUE UNIT-2
for data, enabling better decision-making.
Cost Savings: Reduces the need for manual data reconciliation and
clean-up.
Data reduction
Data reduction means the reduction on certain aspects of data,
typically the volume of data.
Data reduction is the process of replacing a large amount of
measurements with a few characteristic numbers while preserving all
relevant information. This is often achieved by fitting a model through
the data points to describe the data more efficiently.
It uses multiple techniques like the feature selection method, feature
extraction, sampling techniques, data compression methods, and
data binning or discretization to remove noisy data and eradicate
redundant data to change it into efficient presentable data
Data Reduction Techniques
1. Data Sampling: This technique involves selecting a subset of the
data to work with, rather than using the entire dataset. This can
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 22
DATA WAREHOUE UNIT-2
be useful for reducing the size of a dataset while still preserving
the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the
number of features in the dataset, either by removing features
that are not relevant or by combining multiple features into a
single feature.
3. Data Compression: This technique involves using techniques
such as lossy or lossless compression to reduce the size of a
dataset.
4. Data Discretization: This technique involves converting
continuous data into discrete data by partitioning the range of
possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of
features from the dataset that are most relevant to the task at
hand.
6. It’s important to note that data reduction can have a trade-off
between the accuracy and the size of the data. The more data is
reduced, the less accurate the model will be and the less
generalizable it will be.
Data Discretization
Discretization in data mining refers to converting a range of
continuous values into discrete categories. In simpler terms, it's like
grouping ages into categories like 'child,' 'teen,' 'adult', and 'senior'
instead of dealing with each age individually.
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 23
DATA WAREHOUE UNIT-2
Data discretization refers to a method of converting a huge number of
data values into smaller ones so that the evaluation and management
of data become easy. In other words, data discretization is a method
of converting attributes values of continuous data into a finite set of
intervals with minimum data loss.
There are two forms of data discretization first is supervised
discretization, and the second is unsupervised discretization.
Supervised discretization refers to a method in which the class data is
used. Unsupervised discretization refers to a method depending upon
the way which operation proceeds. It means it works on the top-down
splitting strategy and bottom-up merging strategy.
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 24
DATA WAREHOUE UNIT-2
Purpose:
Discretization aims to transform continuous data into a smaller set
of discrete values, making it easier to analyse and interpret.
Process:
It involves dividing the range of a continuous variable into intervals
(bins) and assigning data points to these bins based on their values.
Famous techniques of data discretization
Histogram analysis
Histogram refers to a plot used to represent the underlying frequency
distribution of a continuous data set. Histogram assists the data
inspection for data distribution. For example, Outliers, skewness
representation, normal distribution representation, etc.
Binning
Binning refers to a data smoothing technique that helps to group a
huge number of continuous values into smaller values. For data
discretization and the development of idea hierarchy, this technique
can also be used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm
is executed by dividing the values of x numbers into clusters to
isolate a computational feature of x.
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 25
DATA WAREHOUE UNIT-2
Data discretization using correlation analysis
Discretizing data by linear regression technique, you can get the best
neighbouring interval, and then the large intervals are combined to
develop a larger overlap to form the final 20 overlapping intervals. It
is a supervised procedure.
Benefits:
Data Reduction: Discretization reduces the number of unique
values, leading to a smaller dataset.
Improved Performance: Some data mining algorithms, like decision
trees and clustering, can perform better with discretized data.
Noise Reduction: Discretization can help reduce the impact of
outliers and noise in the data.
Enhanced Visualization: Discretized data can be easier to visualize
and understand.
EasierComparison: It allows for easier comparison of results
between different datasets or models.
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
Dr. Ayesha Naaz, Vedanta Degree College,Raichur pg. 26