DMDW Notes Unit 1
DMDW Notes Unit 1
A Data Warehouse consists of data from multiple heterogeneous data sources and is used for
analytical reporting and decision making. Data Warehouse is a central place where data is stored
from different data sources and applications.
A Data Warehouse is used for reporting and analyzing of information and stores both historical
and current data. The data in DW system is used for Analytical reporting, which is later used by
Business Analysts, Sales Managers or Knowledge workers for decision-making.
Although a data warehouse and a traditional database share some similarities, they need not be
the same idea. The main difference is that in a database, data is collected for multiple
transactional purposes. However, in a data warehouse, data is collected on an extensive scale to
perform analytics. Databases provide real-time data, while warehouses store data to be accessed
for big analytical queries.
ETL stands for Extract, Transform, Load and it is a process used in data warehousing to extract
data from various sources, transform it into a format suitable for loading into a data warehouse,
and then load it into the warehouse. The process of ETL can be broken down into the following
three stages:
Extraction:
The first step of the ETL process is extraction. In this step, data from various source
systems is extracted which can be in various formats like relational databases, No SQL,
XML, and flat files into the staging area. It is important to extract the data from various
source systems and store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can be corrupted also.
Hence loading it directly into the data warehouse may damage it and rollback will be
much more difficult. Therefore, this is one of the most important steps of ETL process.
Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard format. It
may involve following processes/tasks:
II. Cleaning – filling up the NULL values with some default values, mapping U.S.A, United
States, and America into USA, etc.
Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is
finally loaded into the data warehouse. Sometimes the data is updated by loading into the
data warehouse very frequently and sometimes it is done after longer but regular
intervals. The rate and period of loading solely depends on the requirements and varies
from system to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can
transformed and during that period some new data can be extracted. And while the transformed
data is being loaded into the data warehouse, the already extracted data can be transformed. The
block diagram of the pipelining of ETL process is shown below:
TYPES OF DATA WAREHOUSE
The three main types of data warehouses are enterprise data warehouse (EDW), operational data
store (ODS), and data mart.
An enterprise data warehouse (EDW) is a centralized warehouse that provides decision support
services across the enterprise. EDWs are usually a collection of databases that offer a unified
approach for organizing data and classifying data according to subject.
An operational data store (ODS) is a central database used for operational reporting as a data
source for the enterprise data warehouse described above. An ODS is a complementary element
to an EDW and is used for operational reporting, controls, and decision making. An ODS is
refreshed in real-time, making it preferable for routine activities such as storing employee
records. An EDW, on the other hand, is used for tactical and strategic decision support.
A data mart is considered a subset of a data warehouse and is usually oriented to a specific team
or business line, such as finance or sales. It is subject-oriented, making specific data available to
a defined group of users more quickly, providing them with critical insights. The availability of
specific data ensures that they do not need to waste time searching through an entire data
warehouse.
Online Analytical Processing (OLAP)
OLAP stands for Online Analytical Processing. OLAP systems have the capability to analyze
database information of multiple systems at the current time. The primary goal of OLAP Service
is data analysis and not data processing.
OLTP stands for Online Transaction Processing. OLTP has the work to administer day-to-day
transactions in any organization. The main goal of OLTP is data processing not data analysis.
Online Analytical Processing (OLAP) consists of a type of software tool that is used for data
analysis for business decisions. OLAP provides an environment to get insights from the database
retrieved from multiple database systems at one time.
OLAP Examples
Any type of Data Warehouse System is an OLAP system. The uses of the OLAP System are
described below.
OLAP Services requires professionals to handle the data because of its complex modeling
procedure.
OLAP services are expensive to implement and maintain in cases when datasets are large.
We can perform an analysis of data only after extraction and transformation of data in the
case of OLAP which delays the system.
OLAP services are not efficient for decision-making, as it is updated on a periodic basis.
OLTP Examples
An example considered for OLTP System is ATM Center a person who authenticates first will
receive the amount first and the condition is that the amount to be withdrawn must be present in
the ATM. The uses of the OLTP System are described below.
OLTP services allow users to read, write and delete data operations quickly.
OLTP services help in increasing users and transactions which helps in real-time access
to data.
OLTP services help to provide better security by applying multiple security features.
OLTP services help in making better decision making by providing accurate data or
current data.
OLTP Services provide Data Integrity, Consistency, and High Availability to the data.
OLTP has limited analysis capability as they are not capable of intending complex
analysis or reporting.
OLTP has high maintenance costs because of frequent maintenance, backups, and
recovery.
OLTP Services get hampered in the case whenever there is a hardware failure which
leads to the failure of online transactions.
OLTP Services many times experience issues such as duplicate or inconsistent data.
DIFFERENCE BETWEEN OLAP AND OLTP
It is well-known as an online
It is well-known as an online database
Definition database query management
modifying system.
system.
It is subject-oriented. Used
It is application-oriented. Used for
Application for Data Mining, Analytics,
business tasks.
Decisions making, etc.
making.
Backup and It only needs backup from time The backup and recovery process is
Recovery to time as compared to OLTP. maintained rigorously
The technique of frequent pattern mining is built upon a number of fundamental ideas. The
analysis is based on transaction databases, which include records or transactions that represent
collections of objects. Items inside these transactions are grouped together as itemsets.
One of the most popular methods, the Apriori algorithm, uses a step−by−step procedure to find
frequent item sets. It starts by creating candidate itemsets of length 1, determining their support,
and eliminating any that fall below the predetermined cutoff. The method then joins the frequent
itemsets from the previous phase to produce bigger itemsets repeatedly.
Once no more common item sets can be located, the procedure is repeated. The Apriori approach
is commonly used because of its efficiency and simplicity, but because it requires numerous
database scans for big datasets, it can be computationally inefficient.
Consider the following dataset and we will find frequent itemsets and generate association rules
for them.
Step-1:
K=1 (I) Create a table containing support count of each item present in dataset – Called
C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support then remove
those items). This gives us itemset L1.
Step-2:
K=2
Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each
itemset)
Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives
us itemset L2.
Step-3:
Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match. So
itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}
Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For
{I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every
itemset)
find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives
us itemset L3.
Step-4:
Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4) is that,
they should have (K-2) elements in common. So here, for L3, first 2 elements (items) should
match.
Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is {I1,
I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4
Thus, we have discovered all the frequent item-sets. Now generation of strong association rule
comes into picture. For that we need to calculate confidence of each rule.Confidence –A
confidence of 60% means that 60% of the customers, who purchased milk and bread also bought
butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule generation. Itemset
{I1, I2, I3} //from L3 SO rules can be [I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) =
2/4*100=50% [I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50% [I1]=>[I2^I3]
//confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33% [I2]=>[I1^I3] //confidence =
sup(I1^I2^I3)/sup(I2) = 2/7*100=28% [I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) =
2/6*100=33% So if minimum confidence is 50%, then first 3 rules can be considered as strong
association rules.
Advantages of Apriori
Disadvantages of Apriori
It requires a significant amount of calculations if the itemsets are extremely big and the
minimal support is maintained to a bare minimum.
A full scan of the whole database is required.
FP-GROWTH ALGORITHM
A different strategy for frequent pattern mining is provided by the FP−growth algorithm. It
creates a small data structure known as the FP−tree that effectively describes the dataset without
creating candidate itemsets. The FP−growth algorithm constructs the FP−tree recursively and
then directly mines frequent item sets from it. FP−growth can be much quicker than Apriori by
skipping the construction of candidate itemsets, which lowers the number of runs over the
dataset. It is very helpful for sparse and huge datasets.
Let the minimum support be 3. A Frequent Pattern set is built which will contain all the elements
whose frequency is greater than or equal to the minimum support. These elements are stored in
descending order of their respective frequencies. After insertion of the relevant items, the set L
looks like this:-
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
Now, for each transaction, the respective Ordered-Item set is built. It is done by iterating the
Frequent Pattern set and checking if the current item is contained in the transaction in question.
If the current item is contained, the item is inserted in the Ordered-Item set for the current
transaction. The following table is built for all the transactions:
Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
a) Inserting the set {K, E, M, O, Y}:
Here, all the items are simply linked one after the other in the order of occurrence in the set and
initialize the support count for each item as 1.
Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the
set of elements that is common in all the paths in the Conditional Pattern Base of that item and
calculating its support count by summing the support counts of all the paths in the Conditional
Pattern Base.
From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by
pairing the items of the Conditional Frequent Pattern Tree set to the corresponding to the item
as given in the below table.
For each row, two types of association rules can be inferred for example for the first row which
contains the element, the rules K -> Y and Y -> K can be inferred. To determine the valid rule,
the confidence of both the rules is calculated and the one with confidence greater than or equal to
the minimum confidence value is retained.
The FP Growth algorithm in data mining has several advantages over other frequent itemset
mining algorithms, as mentioned below:
Efficiency:
FP Growth algorithm is faster and more memory-efficient than other frequent itemset
mining algorithms such as Apriori, especially on large datasets with high dimensionality.
This is because it generates frequent itemsets by constructing the FP-Tree, which
compresses the database and requires only two scans.
Scalability:
FP Growth algorithm scales well with increasing database size and itemset
dimensionality, making it suitable for mining frequent itemsets in large datasets.
Resistant to noise:
FP Growth algorithm is more resistant to noise in the data than other frequent itemset
mining algorithms, as it generates only frequent itemsets and ignores infrequent itemsets
that may be caused by noise.
Parallelization:
FP Growth algorithm can be easily parallelized, making it suitable for distributed
computing environments and allowing it to take advantage of multi-core processors.
Disadvantages of FP Growth Algorithm
While the FP Growth algorithm in data mining has several advantages, it also has some
limitations and disadvantages, as mentioned below:
Memory consumption:
Although the FP Growth algorithm is more memory-efficient than other frequent itemset
mining algorithms, storing the FP-Tree and the conditional pattern bases can still require
a significant amount of memory, especially for large datasets.
Complex implementation:
The FP Growth algorithm is more complex than other frequent itemset mining
algorithms, making it more difficult to understand and implement.
Eclat Algorithm
Equivalence Class Clustering and bottom−up Lattice Traversal are the acronyms for the Eclat
algorithm, a well−liked frequent pattern mining method. It explores the itemset lattice using a
depth−first search approach, concentrating on the representation of vertical data formats.
Transaction identifiers (TIDs) are effectively used by Eclat to locate intersections between item
sets. This technique is renowned for its ease of use and little memory requirements, making it
appropriate for mining frequent itemsets in vertical databases.
The ECLAT algorithm stands for Equivalence Class Clustering and bottom-up Lattice
Traversal. It is one of the popular methods of Association Rule mining. It is a more efficient and
scalable version of the Apriori algorithm. While the Apriori algorithm works in a horizontal
sense imitating the Breadth-First Search of a graph, the ECLAT algorithm works in a vertical
manner just like the Depth-First Search of a graph. This vertical approach of the ECLAT
algorithm makes it a faster algorithm than the Apriori algorithm. How the algorithm work?
: The basic idea is to use Transaction Id Sets(tidsets) intersections to compute the support value
of a candidate and avoiding the generation of subsets which do not exist in the prefix tree. In the
first call of the function, all single items are used along with their tidsets. Then the function is
called recursively and in each recursive call, each item-tidset pair is verified and combined with
other item-tidset pairs. This process is continued until no candidate item-tidset pairs can be
combined. Let us now understand the above stated working with an example:- Consider the
following transactions record:-
The above-given data is a boolean matrix where for each cell (i, j), the value denotes whether the
j’th item is included in the i’th transaction or not. 1 means true while 0 means false. We now call
the function for the first time and arrange each item with its tidset in a tabular fashion:- k = 1,
minimum support = 2
Item Tidset
We now recursively call the function till no more item-tidset pairs can be combined:- k = 2
Item Tidset
k=3
Item Tidset
Item Tidset
k=4
Item Tidset
We stop at k = 4 because there are no more item-tidset pairs to combine. Since minimum support
= 2, we conclude the following rules from the given dataset:-
Bread Butter
Bread Milk
Bread Jam
Items Bought Recommended Products
Butter Milk
Butter Coke
Butter Jam
Memory Requirements: Since the ECLAT algorithm uses a Depth-First Search approach,
it uses less memory than Apriori algorithm.
Speed: The ECLAT algorithm is typically faster than the Apriori algorithm.
Number of Computations: The ECLAT algorithm does not involve the repeated scanning
of the data to compute the individual support values.
Association analysis can provide valuable insights into consumer behavior and preferences. It
can help retailers identify the items that are frequently purchased together, which can be used to
optimize product placement and promotions. Similarly, it can help e-commerce websites
recommend related products to customers based on their purchase history.
Types of Associations
Here are the most common types of associations used in data mining:
Correlation Analysis is a data mining technique used to identify the degree to which two or
more variables are related or associated with each other. Correlation refers to the statistical
relationship between two or more variables, where the variation in one variable is associated
with the variation in another variable. In other words, it measures how changes in one variable
are related to changes in another variable. Correlation can be positive, negative, or zero,
depending on the direction and strength of the relationship between the variables.
For example, we are studying the relationship between the hours of study and the grades
obtained by students. If we find that as the number of hours of study increases, the grades
obtained also increase, then there is a positive correlation between the two variables. On the
other hand, if we find that as the number of hours of study increases, the grades obtained
decrease, then there is a negative correlation between the two variables. If there is no relationship
between the two variables, we would say that there is zero correlation.
After performing a correlation analysis, it is important to interpret the results to draw meaningful
conclusions about the relationship between the analyzed variables. One common way to interpret
correlation coefficients is by using the following general guidelines -
Any score from +0.5 to +1 indicates a very strong positive correlation, meaning that the
variables are strongly related in a positive direction, increasing together or
simultaneously.
Any score from -0.5 to -1 indicates a strong negative correlation, meaning that the
variables are strongly related in a negative direction. It also means that as one variable
decreases, the other variable increases and vice-versa.
A score of 0 indicates no correlation, meaning there is no relationship between the
analyzed variables.