0% found this document useful (0 votes)
20 views40 pages

DW Lecture UNIT 2

This document covers ETL (Extraction, Transformation, Loading) and OLAP (Online Analytical Processing) technologies, detailing their processes, differences, and types of data warehouses. It explains the phases of ETL, including extraction, cleansing, transformation, and loading, while contrasting ETL with ELT (Extract, Load, Transform) methodologies. Additionally, it outlines various types of data warehouses, such as host-based and LAN-based, and their respective characteristics and limitations.

Uploaded by

Sinu Sha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views40 pages

DW Lecture UNIT 2

This document covers ETL (Extraction, Transformation, Loading) and OLAP (Online Analytical Processing) technologies, detailing their processes, differences, and types of data warehouses. It explains the phases of ETL, including extraction, cleansing, transformation, and loading, while contrasting ETL with ELT (Extract, Load, Transform) methodologies. Additionally, it outlines various types of data warehouses, such as host-based and LAN-based, and their respective characteristics and limitations.

Uploaded by

Sinu Sha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

UNIT II-ETL AND OLAP TECHNOLOGY

What is ETL – ETL Vs ELT – Types of Data warehouses - Data warehouse Design and Modeling - Delivery
Process - Online Analytical Processing (OLAP) - Characteristics of OLAP - Online Transaction Processing
(OLTP) Vs OLAP - OLAP operations- Types of OLAP- ROLAP Vs MOLAP Vs HOLAP.

Lecture Topic 1: ETL

The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation and
Loading.

The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.

To maintain its value as a tool for decision-makers, Data warehouse technique needs to change
with business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse
system and needs to be agile, automated, and well documented.

TL consists of three separate phases:

Extraction
o Extraction is the operation of extracting information from a source system for further use in a data
warehouse environment. This is the first stage of the ETL process.
o Extraction process is often one of the most time-consuming tasks in the ETL.
o The source systems might be complicated and poorly documented, and thus determining which
data needs to be extracted can be difficult.

o The data has to be extracted several times in a periodic manner to supply all changed data to the
warehouse and keep it up-to-date.
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to improve data
quality. The primary data cleansing features found in ETL tools are rectification and
homogenization. They use specific dictionaries to rectify typing mistakes and to recognize
synonyms, as well as rule-based cleansing to enforce domain-specific rules and defines
appropriate associations between values.

The following examples show the essential of data cleaning:

If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-datelist of
contact addresses, email addresses and telephone numbers must be available.

If a client or supplier calls, the staff responding should be quickly able to find the person in the
enterprise database, but this need that the caller's name or his/her company name is listed in the
database.

If a user appears in the databases with two or more slightly different names or different account
numbers, it becomes difficult to update the customer's information.

Transformation
Transformation is the core of the reconciliation phase. It converts records from its operational source
format into a particular data warehouse format. If we implement a three-layer architecture, this
phase outputs our reconciled data layer.

The following points must be rectified in this phase:

o Loose texts may hide valuable information. For example, XYZ PVT Ltd does not explicitly show
that this is a Limited Partnership company.
o Different formats can be used for individual data. For example, data can be saved as a string or as
three integers.

Following are the main transformation processes aimed at populating the reconciled data layer:

o Conversion and normalization that operate on both storage formats and units of measure to make
data uniform.

o Matching that associates equivalent fields in different sources.


o Selection that reduces the number of source fields and records.

Cleansing and Transformation processes are often closely linked in ETL tools.
Loading
The Load is the process of writing the data into the target database. During the load step, it is necessary
to ensure that the load is performed correctly and with as little resources as possible.

Loading can be carried in two ways:

1. Refresh: Data Warehouse data is completely rewritten. This means that older file is replaced.
Refresh is usually used in combination with static extraction to populate a data warehouse
initially.
2. Update: Only those changes applied to source information are added to the Data Warehouse. An
update is typically carried out without deleting or modifying preexisting data. This method is
used in combination with incremental extraction to update data warehouses regularly.
Difference between ETL and ELT

ETL(Extract, Transform, and Load)

Extract, Transform and Load is the technique of extracting the record from sources (which is present
outside or on-premises, etc.) to a staging area, then transforming or reformatting with business
manipulation performed on it in order to fit the operational needs or data analysis, and later
loading into the goal or destination databases or data warehouse.

Strengths
Development Time: Designing from the output backwards provide that only information applicable
to the solution is extracted and processed, potentially decreasing development, delete, and
processing overhead.

Targeted data: Due to the targeted feature of the load process, the warehouse contains only
information relevant to the presentation. Reduced warehouse content simplify the security
regime enforce and hence the administration overheads.

Tools Availability: The number of tools available that implement ETL provides the flexibility of
approach and the opportunity to identify the most appropriate tool. The proliferation of tools has
to lead to a competitive functionality war, which often results in loss of maintainability

Weaknesses
Flexibility: Targeting only relevant information for output means that any future requirements that
may need data that was not included in the original design will need to be added to the ETL
routines. Due to the nature of tight dependency between the methods developed, this often leads
to a need for fundamental redesign and development. As a result, this increase the time and cost
involved.

Hardware: Most third-party tools utilize their engine to implement the ETL phase. Regardless of the
estimate of the solution, this can necessitate the investment in additional hardware to
implement the tool's ETL engine. The use of third-party tools to achieve the ETL process
compels the information of new scripting languages and processes.

Learning Curve: Implementing a third-party tools that uses foreign processes and languages
results in the learning curve that is implicit in all technologies new to an organization and can
often lead to consecutive blind alleys in their use due to shortage of experience.

ELT(Extract,LoadandTransform)
ELTstands for Extract, Load and Transform is the various sight while looking at data migrationor
movement. ELT involves the extraction of aggregate information from the source system and
loading to the target method instead oftransformation between the extraction and loading phase.
Once the data is copied or loaded into the target method, then change takes place.

The extract and load step can be isolated from the transformation process. Isolating the load
phase from the transformation process delete an inherent dependency between these phases. In
addition to containing the data necessary for the transformations, the extract and load process can
include components of data that may be essential in the future. The load phase could take the
entire source and loaded it into the warehouses.

Separating the phases enables the project to be damaged down into smaller chunks, thus making
it more specific and manageable.

Performing the data integrity analysis in the staging method enables a further phase in theprocess
to be isolated and dealt with at the most appropriate point in the process. This method also helps
to ensure that only cleaned and checked information is loaded into the warehouse for
transformation.

Isolating the transformations from the load steps helps to encourage a more staged way to the
warehouse design and implementation.
Strengths
Project Management: Being able to divide the warehouse method into specific and isolated
functions, enables a project to be designed on a smaller function basis, therefore the project can
be broken down into feasible chunks.

Flexible & Future Proof: In general, in an ELT implementation, all record from the sources are
loaded into the data warehouse as part of the extract and loading process. This, linked with the
isolation of the transformation phase, means that future requirements can easily be incorporated
into the data warehouse architecture.

Risk minimization: Deleting the close interdependencies between each technique of the
warehouse build system enables the development method to be isolated, and the individual
process design can thus also be separated. This provides a good platform for change,
maintenance and management.

Utilize Existing Hardware: In implementing ELT as a warehouse build process, the essential tools
provided with the database engine can be used.

Utilize Existing Skill sets: By using the functionality support by the database engine, the existing
investment in database functions are re-used to develop the warehouse. No new skills need to be
learned, and the full weight of the experience in developing the engines technology is utilized,
further reducing the cost and risk in the development process.

Weaknesses
Against the Norm: ELT is a new method to data warehouse design and development. While it
has proven itself many times over through its abundant use in implementations throughout the
world, it does require a change in mentality and design approach against traditional methods.

Tools Availability: Being an emergent technology approach, ELT suffers from the limited
availability of tools.

DifferencebetweenETLvs.ELT
Basics ETL ELT

Process Data is transferred to the ETL server and Data remains in the DB except for cross
moved back to DB. High network Database loads (e.g. source to object).
bandwidth required.

Transformation Transformations are performed in ETL Transformations are performed (in the source
Server. or) in the target.

Code Usage Typically used for Typically used for


o Source to target transfer
o High amounts of data
o Compute-intensive
Transformations
o Small amount of data

Time- It needs highs maintenance as you need Low maintenance as data is always available.
Maintenance to select data to load and transform.

Calculations Overwrites existing column or Need to Easily add the calculated column to the
append the dataset and push to the target existing table.
platform.

Analysis

Lecture Topic 2: Types of Data Warehouses

There are different types of data warehouses, which are as follows:

Host-Based Data Warehouses


There are two types of host-based data warehouses which can be implemented:

o Host-Based mainframe warehouses which reside on a high volume database. Supported by robust
and reliable high capacity structure such as IBM system/390, UNISYS and Data General sequent
systems, and databases such as Sybase, Oracle, Informix, and DB2.
o Host-Based LAN data warehouses, where data delivery can be handled either centrally or from
the workgroup environment. The size of the data warehouses of the database depends on the
platform.
Data Extraction and transformation tools allow the automated extraction and cleaning of data
from production systems. It is not applicable to enable direct access by query tools to these
categories of methods for the following reasons:

1. A huge load of complex warehousing queries would possibly have too much of a harmful impact
upon the mission-critical transaction processing (TP)-oriented application.
2. These TP systems have been developing in their data base design for transaction throughput. In
all methods, a database is designed for optimal query or transaction processing. A complex
business query needed the joining of many normalized tables, and as result performance will
usually be poor and the query constructs largely complex.
3. There is no assurance that data in two or more production methods will be consistent.

Host-Based (MVS) Data Warehouses


Those data warehouse uses that reside on large volume databases on MVS are the host-based
types of data warehouses. Often the DBMS is DB2 with a huge variety of original source for
legacy information, including VSAM, DB2, flatfiles, and Information Management System

(IMS).

Before embarking on designing, building and implementing such a warehouse, some further
considerations must be given because

1. Such databases generally have very high volumes of data storage.


2. Such ware houses may require support for both MVS and customer- based report and query
facilities.
3. These ware houses have complicated source systems.
4. Such systems needed continuous maintenance since these must also be used for mission-critical
objectives.

To make such data warehouses building successful, the following phases are generally followed:

1. Unload Phase: It contains selecting and scrubbing the operation data.


2. Transform Phase: For translating it into an appropriate form and describing the rules for
accessing and storing it.
3. Load Phase: For moving there cord directly into DB2 tables or a particular file for moving it into
another database or non-MVS warehouse.

An integrated Metadata repository is central to any data warehouse environment. Such a facility
is required for documenting data sources, data translation rules, and user are as to the warehouse.
It provides a dynamic network between the multiple data source databases and the DB2 of the
conditional data warehouses.
A metadata repository is necessary to design, build, and maintain data warehouse processes. It
should be capable of providing data as to what data exists in both the operational system and data
warehouse, where the data is located. The mapping of the operational data to the warehouse
fields and end-user access techniques. Query, reporting, and maintenance are another
indispensable method of such a data warehouse. An MVS-based query and reporting tool for
DB2.

Host-Based (UNIX)DataWarehouses
Oracle and Informix RDBMSs support the facilities for such data warehouses. Both of these
databases can extract information from MVS¬ based databases as well as a higher number of
other UNIX¬ based databases. These types of warehouses follow the same stage as the host-
based MVS data warehouses. Also, the data from different network servers can be created. Since
file attribute consistency is frequent across the inter-network.

LAN-Based Work group Data Warehouses


A LAN based workgroup warehouse is an integrated structure for building and maintaining a
data warehouse in a LAN environment. In this warehouse, we can extract information from a
variety of sources and support multiple LAN based warehouses, generally chosen warehouse
databases to include DB2 family, Oracle, Sybase, and Informix. Other databases that can also be
contained through infrequently are IMS, VSAM, FlatFile, MVS, and VH.

Designed for the work group environment, a LAN based work group warehouse is optimal for
any business organization that wants to build a data warehouse often called a data mart. This
type of data warehouse generally requires a minimal initial investment and technical training.

Data Delivery: With a LAN based workgroup warehouse, customer needs minimal technical
knowledgetocreateandmaintainastoreofdatathatcustomizedforuseatthedepartment,
business unit, or workgroup level. A LAN based workgroup warehouse ensures the delivery of
information from corporate resources by providing transport access to the data in the warehouse.

Host-Based Single Stage(LAN)Data Warehouses


Within a LAN based data warehouse, data delivery can be handled either centrally or from the
workgroup environment so business groups can meet process their data needed without
burdening centralized IT resources, enjoying the autonomy of their data mart without comprising
overall data integrity and security in the enterprise.

Limitations
Both DBMS and hardware scalability methods generally limit LAN based warehousing
solutions.

Many LAN based enterprises have not implemented adequate job scheduling, recovery
management, organized maintenance, and performance monitoring methods to provide robust
warehousing solutions.

Often these warehouses are dependent on other platforms for source record. Building an
environment that has data integrity, recoverability, and security require careful design, planning,
and implementation. Otherwise, synchronization of transformation and loads from sources to the
server could cause innumerable problems.
o A LAN based warehouse provides data from many sources requiring a minimal initial
investment and technical knowledge. ALAN based warehouse can also work replication
tools for populatingandupdatingthedatawarehouse.Thistypeofwarehousecaninclude
business views, histories, aggregation, versions in, and heterogeneous source support,
such as DB2 Family
o IMS, VSAM, Flat File[MVS and VM]

A single store frequently drives a LAN based warehouse and provides existing DSS applications,
enabling the business user to locate data in their data warehouse. The LAN based warehouse can
support business users with complete data to information solution. The LAN based warehouse
can also share metadata with the ability to catalog business data and make it feasible for anyone
who needs it.

Multi-StageDataWarehouses
It refers to multiple stages in transforming methods for analyzing data through aggregations. In
other words, staging of the data multiple times before the loading operation into the data
warehouse, data gets extracted form source systems to staging area first, then gets loaded to data
warehouse after the change and then finally to departmentalized data marts.

This configuration is well suitable to environments where end-clients in numerous capacities


require access to both summarized information for up to the minute tactical decisions as well as
summarized, a commutative record for long-term strategic decisions. Both the Operational Data
Store (ODS) and the data warehouse may reside on host-based or LAN Based databases,
depending on volume and custom requirements. These contain DB2, Oracle, Informix, IMS, Flat
Files, and Sybase.

Usually, the ODS stores only the most up-to-date records. The data warehouse stores the
historical calculation of the files. At first, the information in both databases will be very similar.
For example, the records for an ew client will look the same.
As changes to the user record occur, the ODs will be refreshed to reflect only the most current
data, whereas the data warehouse will contain both the historical data and the new information.
Thus the volume requirement of the data warehouse will exceed the volume requirements of the
ODS overtime. It is not familiar to reach a ratio of 4 to 1 in practice.

Stationary Data Warehouses


In this type of data warehouses, the data is not changed from the sources, as shown in fig:

Instead, the customer is given direct access to the data. For many organizations, infrequent
access, volume issues, or corporate necessities dictate such as approach. This schema does
generate several problems for the customer such as

o Identifying the location of the information for the users


o Providing clients the ability to query different DBMSs as is they were all a single DBMS with a
single API.

o Impacting performances in the customer will be competing with the production data stores.

Such a warehouse will need highly specialized and sophisticated 'middleware' possibly with a
single interaction with the client. This may also be essential for a facility to display the extracted
record for the user before report generation. An integrated metadata repository becomes an
absolute essential under this environment.
Distributed Data Warehouses
The concept of a distributed data warehouse suggests that there are two types of distributed data
warehouses and their modifications for the local enterprise warehouses which are distributed
throughout the enterprise and a global warehouse as shown in fig:

Characteristics of Local data warehouses


o Activity appears at the local level
o Bulk of the operational processing
o Local site is autonomous
o Each local data warehouse has its unique architecture and contents of data
o The data is unique and of prime essential to that locality only
o Majority of there cord is local and not replicated
o Any intersection of data between local data ware houses is circumstantial
o Local ware house serves different technical communities
o The scope of the local data warehouses is finite to the local site
o Local warehouses also include historical data and are integrated only within the localsite.

Virtual Data Warehouses


Virtual Data Warehouses is created in the following stages:

1. Installing a set of data approach, data dictionary, and process management facilities.
2. Training end-clients.
3. Monitoring how DW facilities will be used
4. Based upon actual usage, physically Data Warehouse is created to provide the high-
frequency results

This strategy defines that end users are allowed to get at operational databases directly using
whatever tools are implemented to the data access network. This method provides ultimate
flexibility as well as the minimum amount of redundant information that must be loaded and
maintained. The data warehouse is a great idea, but it is difficult to build and requires
investment. Why not use a cheap and fast method by eliminating the transformation phase of
repositories for metadata and another database. This method is termed the 'virtual data
warehouse.'

To accomplish this, there is a need to define four kinds of data:

1. A data dictionary including the definitions of the various databases.


2. A description of the relationship between the data components.
3. The description of the method user will interface with the system.
4. The algorithms and business rules that describe what to do and how to do it.

Disadvantages
1. Since queries compete with production record transactions, performance can be degraded.
2. There is no metadata, no summary record, or no individual DSS (Decision Support
System) integration or history. All queries must be copied, causing an additional burden
on the system.
3. There is no refreshing process, causing the queries to be very complex.
Lecture Topic 3:DataWarehouseModeling
Data warehouse modeling is the process of designing the schemas of the detailed and summarized
information of the data warehouse.

Data warehouse modeling is an essential stage of building a data warehouse for two main reasons.

1. Through the schema, data warehouse clients can visualize the relationships among
the warehouse data, to use them with greater ease.

2. A well-designed schema allows an effective data warehouse structure to emerge, to


help decrease the cost of implementing the warehouse and improve the efficiency of
using it.

Data Warehouse model is illustrated in the given diagram.


The data within the specific warehouse itself has a particular architecture with the emphasis on
various levels of summarization, as shown in figure:

The current detail record is central in importance as it:

o Reflects the most current happenings, which are commonly the most stimulating.
o It is numerous as it is saved at the lowest method of the Granularity.
o It is always(almost)saved on disk storage, which is fast to access but expensive and
difficult to manage.
o Older detail data is stored in some form of mass storage, and it is in frequently accessed
And kept at a level detail consistent with current detailed data.
o Lightly summarized data is data extract from the low level of detail found at the
current, detailed level and usually is stored on disk storage. When building the data
warehouse have to remember what unit of time is summarization done over and also
the components or what attributes the summarized data will contain.
o Highly summarized data is compact and directly available and can even be
found outside the warehouse.

Metadata is the final element of the data warehouses and is really of various dimensions in which it
is not the same as file drawn from the operational data, but it is used as:-

o A directory to help the DSS investigator locate the items of the data warehouse.
o A guide to the mapping of record as the data is changed from the operational data to the data
warehouse environment.

o A guide to the method used for summarization between the current, accurate data and the lightly
summarized information and the highly summarized data, etc.

Data Modeling Life Cycle


In this section, we define a data modeling life cycle. It is a straight forward process of transforming
the business requirements to fulfill the goals for storing, maintaining, and accessing the data
within IT systems. The result is a logical and physical data model for an enterprise data
warehouse.

The objective of the data modeling life cycle is primarily the creation of a storage area for business
information. That area comes from the logical and physical data modeling stages, as shown in
Figure:
Conceptual DataModel
A conceptual data model recognizes the highest-level relationships between the differententities.

Characteristics of the conceptual data model

o It contains the essential entities and the relationships among them.


o No attribute is specified.
o No primary key is specified.

We can see that the only data shown via the conceptual data model is the entities that define the
data and the relationships between those entities. No other data, as shown through the conceptual
data model.

Logical Data Model


A logical data model defines the information in as much structure as possible, without observing
how they will be physically achieved in the database. The primary objective of logical data
modeling is to document the business data structures, processes, rules, and relationships by a
single view - the logical data model.

Features of a logical data model

o It involves all entities and relationships among them.


o All attributes for each entity are specified.
o The primary key for each entity is stated.
o Referential Integrity is specified(FK Relation).

The phase for designing the logical data model which are as follows:

o Specify primary keys for all entities.


o List the relationships between different entities.
o List all attributes for each entity.
o Normalization.
o No data types are listed

ADVERTISEMENT

Physical Data Model

Physical data model describes how the model will be presented in the database. A physical database
model demonstrates all table structures, column names, data types, constraints, primary key,
foreign key, and relationships between tables. The purpose of physical data modeling is the
mapping of the logical data model to the physical structures of the RDBMS system hosting the
data warehouse. This contains defining physical RDBMS structures, such as tables and data
Types to use when storing the information. It may also include the definition of new data structures for
enhancing query performance.

Characteristics of a physical data model

o Specification all tables and columns.


o Foreign keys are used to recognize relationships between tables.

The steps for physical data model design which are as follows:

o Convert entities to tables.


o Convert relationships to foreign keys.
o Convert attributes to columns.

Data Warehouse Design


A Data warehouse, integrated for online business analytical processing (OLAP) needs to meet the
requirements from all the business stages within the entire organization. Furthermore, business analytical
functions change over time, which results in changes in the requirements for the systems. Therefore, data
warehouse and OLAP systems are dynamic, and the design process is continuous.
The target of the design becomes how there cord from multiple data sources should be extracted,
transformed, and loaded (ETL) to be organized in a database as the data warehouse.

There are two approaches

1. "top-down"approach

2. "bottom-up"approach

Top-down DesignApproach
In the "Top-Down" design approach, a data warehouse is described as a subject-oriented, time-variant,
non-volatile and integrated data repository for the entire enterprise data from different sources are
validated, reformatted and saved in a normalized (up to 3NF) database as the data warehouse. The data
warehouse stores "atomic" information, the data at the lowest level of granularity, from where
dimensional data marts can be built by selecting the data required for specific business subjects or
particular departments.An approach is a data-driven approach as the information is gathered and
integrated first and then business requirements by subjects for building data marts are formulated. The
advantageofthis methodis whichitsupportsasingleintegrateddata source.Thusdatamartsbuiltfromit will
have consistency when they overlap.

Advantagesoftop-downdesign
DataMartsareloaded fromthedatawarehouses.
Developingnewdatamartfromthedatawarehouseis veryeasy.
Disadvantages of top-down design
This technique is inflexible to changing departmental needs.
The cost of implementing the project is high.
Bottom-Up Design Approach
In the "Bottom-Up" approach, a data warehouse is described as "a copy of transaction data specifical
architecture for query and analysis," term the star schema.

Data marts include the lowest grain data and, if needed, aggregated data too. Instead of a normalized
database for the data warehouse, a denormalized dimensional database is adapted to meet the data
delivery requirements of data warehouses. Using this method, to use the set of data marts as the
enterprise data warehouse, data marts should be built with conformed dimensions in mind, defining that
ordinary objects are represented the same in different data marts. The conformed dimensions connected
the data marts to form a data warehouse, which is generally called a virtual data warehouse.

The advantage of the "bottom-up" design approach is that it has quick ROI, as developing a data
mart, a data warehouse for a single subject, takes far less time and effort than developing an
enterprise-wide data warehouse. Also, the risk of failure is even less. This method is inherently
incremental. This method allows the project team to learn and grow.

Advantages of bottom-up design


Documents can be generated quickly.
The data warehouse can be extended to accommodate new business units.
It is just developing new data marts and then integrating with other datamarts.
Disadvantages of bottom-updesign
The locations of the data warehouse and the data marts are reversed in the bottom-up approach
design.

Differentiate between Top-Down Design Approach and


Bottom-Up Design Approach
Top-DownDesignApproach Bottom-UpDesignApproach

Breaks the vast problem in to smaller sub Solves the essential low-level problem and integrates
problems. them into a higher one.

Inherently architected-not a union of Inherently incremental; can schedule essential data


several data marts. marts first.

Single, central storage of information Departmental information stored.


about the content.

Centralized rules and control. Departmental rules and control.

It includes redundant information. Redundancy can be removed.

It may see quick results if implemented Less risk of failure, favorable return on investment,
with repetitions. and proof of techniques.

ETL Process
ETL stands for Extraction, Transformation, and loading. ETL is a process which is used to
Extract data, Transform the data and loading of the data to the final source. ETL follows a
process of loading the data from the source system to the Data Warehouse.

Steps to perform the ETL process are:

Extraction
Extraction is the first process where data from different sources like text file, XML file, Excel
file, or various other sources are collected.
Transformation
Transformation is the second step of the ETL process, where all the collected data has been
transformed into the same format. The format can be anything as per our requirement. In this
step, a set of rules of functions are applied to the extracted data to convert it into a singlestandard
format.

It may involve the following tasks:

o Filtering: Only specific attributes are loading into the data warehouse.
o Cleaning: Filling up the null values with specific default values.
o Joining: Join the multiple attributes into the one.
o Splitting: Splitting the single attribute into multiple attributes.
o Sorting: Sort the tuples based on the attributes.
Loading
Loading is the final step of the ETL process. The big chunk of data is collected from various
sources, transformed them, and finally loaded to the data warehouse.

ETL is a process to extract the data from different source systems, transform the data, and load the
data into the data warehouse. ETL process requires active input from various stakeholders
including, developers, analysts, testers, top executive.

ETL (Extract, Transform and Load) is an automated process of extracting the information from the
rawdata which isrequired foranalysisand transforms it into aformat that canserve business needs
and loads it into a data warehouse. ETL typically summarize data to reduce its size and improve
performance for a specific type of analysis.

ETL process uses the pipelining concept. In this concept, as soon as the data is extracted, it can be
transformed, and during the period of transformation, new data can be obtained. And when the
modified data is being loaded into the data warehouse, the already extracted data can be
transformed.

When we build an ETL infrastructure, we must integrate data sources, carefullyplan, and test to
ensure that we transform the source data correctly.
Lecture Topic 4: OLAP(OnlineAnalyticalProcessing)
OLAP stands for On-Line Analytical Processing. OLAP is a classification of software
technology which authorizes analysts, managers, and executives to gain insight into information
through fast, consistent, interactive access in a wide variety of possible views of data that has
been transformed from raw information to reflect the real dimensionality of the enterprise as
understood by the clients.

OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis, Simulation-
Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables end-clients to
perform ad hoc analysis of record in multiple dimensions, providing the insight and
understanding they require for better decision making.

OLAP applications are used by a variety of the functions of an organization.


Finance and accounting:
o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling
Sales and Marketing
o Sales analysis and forecasting
o Market research analysis
o Promotion analysis
o Customer analysis
o Market and customer
segmentation Production
o Production planning
o Defect analysis

OLAP cubes have two main purposes. The first is to provide business users with a data model more
intuitive to them than a tabular model. This model is called a Dimensional Model.
Characteristics of OLAP

o In the FASMI characteristics of OLAP methods, the term derived from the first letters of the
characteristics are:
o Fast
o It defines which the system targeted to deliver the most feedback to the client within about five
seconds, with the elementary analysis taking no more than one second and very few taking more
than 20 seconds.
o Analysis
o It defines which the method can cope with any business logic and statistical analysis that is
relevant for the function and the user, keep it easy enough for the target client. Although some
preprogramming may be needed we do not think it acceptable if all application definitions have to
be allow the user to define new Adhoc calculations as part of the analysis and to document on the
data in any desired method, without having to program so we excludes products (like Oracle
Discoverer) that do not allow the user to define new Adhoc calculation as part of the analysis and
to document on the data in any desired product that do not allow adequate end user-oriented
calculation flexibility.
o Share
o It defines which the system tools all the security requirements for understanding and, if multiple
write connection is needed, concurrent update location at an appropriated level, not all functions
need customer to write data back, but for the increasing number which does, the system should be
able to manage multiple updates in a timely, secure manner.
o Multidimensional
o This is the basic requirement. OLAP system must provide a multidimensional conceptual view of
the data, including full support for hierarchies, as this is certainly the most logical method to
analyze business and organizations.
o Information
o The system should be able to hold all the data needed by the applications. Data sparsity should be
handled in an efficient manner.

The main characteristics of OLAP are as follows:

o Multidimensional conceptual view: OLAP systems let business users have a dimensional and
logical view of the data in the data warehouse. It helps in carrying slice and dice operations.
o Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should
provide normal data base operations,
containingretrieval,update,adequacycontrol,integrity,andsecurity.
o Accessibility: OLAP acts as a mediator between data warehouses and front-end. The OLAP
operations should be sitting between data sources (e.g., data warehouses) and an OLAP front-end.
o Storing OLAP results: OLAP results are kept separate from data sources.
o Uniform documenting performance: Increasing the number of dimensions or data basesize
should not significantly degrade the reporting performance of the OLAP system.
o OLAP provides for distinguishing between zero values and missing values so that aggregates are
computed correctly.
o OLAP system should ignore all missing values and compute correct aggregate values.
o OLAP facilitate interactive query and complex analysis for the users.
o OLAP allows users to drill down for greater details or roll up for aggregations of metrics along a
single business dimension or across multiple dimension.
o OLAP provides the ability to perform intricate calculations and comparisons.
o OLAP presents results in a number of meaningful ways, including charts and graphs.

Benefits of OLAP
OLAP holds several benefits for businesses:-

1. OLAP helps managers in decision-making through the multidimensional record views that it is
efficient in providing, thus increasing their productivity.
2. OLAP functions are self-sufficient owing to the inherent flexibility support to the organized
databases.
3. It facilitates simulation of business models and problems, through extensive management of
analysis-capabilities.
4. In conjunction with data warehouse, OLAP can be used to support a reduction in the application
backlog, faster data retrieval, and reduction in query drag.

Difference between OLTP and OLAP


Following are the difference between OLAP and OLTP system.

1) Users: OLTP systems are designed for office worker while the OLAP systems are designed for
decision-makers. Therefore while an OLTP method may be accessed by hundreds or even thousands of
clients in a huge enterprise, an OLAP system is suitable to be accessed only by a select class of manager
and may be used only by dozens of users.
2) Functions: OLTP systems are mission-critical. They provide day-to-day operations of an enterprise
and are largely performance and availability driven. These operations carry out simple repetitive
operations. OLAP systems are management-critical to support the decision of enterprise support tasks
using detailed investigation.

3) Nature : Although SQL queries return a set of data, OLTP methods are designed to step one record at
the time, for example, a data related to the user who may be on the phone or in the store. OLAP system is
not designed to deal with individual customer records. Instead, they include queries that deal with many
data at a time and provide summary or aggregate information to a manager. OLAP applications include
data stored in a data warehouses that have been extracted from many tables and possibly from more than
one enterprise database.

4) Design: OLTP database operations are designed to be application-oriented while OLAP operations are
designed to be subject-oriented. OLTP systems view the enterprise record as a collection of tables
(possibly based on an entity-relationship model). OLAP operations view enterprise information as
multidimensional).

5) Data: OLTP systems usually deal only with the current status of data. For example, a record about an
employee who left three years ago may not be feasible on the Human Resources System. The old data
may have been achieved on some type of stable storage media and may not be accessible online. On the
other hand, OLAP systems needed historical data over several years since trends are often essential in
decision making.

6) Kind of use: OLTP methods are used for reading and writing operations while OLAP methods usually
do not update the data.

7) View: An OLTP system focuses primarily on the current data within an enterprise or department,
which does not refer to historical data or data in various organizations. In contrast, an OLAP system spans
multiple version of a database schema, due to the evolutionary process of an organization. OLAP system
also deals with information that originates from different organizations, integrating information from
many data stores. Because of their huge volume, these are stored on multiple storage media.
8) Access Patterns: The access pattern of an OLTP system consist primarily of short, atomic
transactions. Such a system needed concurrency control and recovery techniques. However, access to
OLAP systems is mostly read-only operations because these data warehouses store historical information.

OLAP Operations in the Multidimensional Data


Model
OLAP stands for Online Analytical Processing Server. It is a software technology that
allows users to analyze information from multiple database systems at the same time. It
is based on multidimensional data model and allows the user to query on multi-
dimensional data (eg. Delhi -> 2018 -> Sales data). OLAP databases are divided in to
one or more cubes and these cubes are known as Hyper-cubes.

OLAP operations:

TherearefivebasicanalyticaloperationsthatcanbeperformedonanOLAPcube:

1. Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
 Moving down in the concept hierarchy
 Adding a new dimension In the cube given in overview section, the drill down
operation is performed by moving down in the concept hierarchy of Time dimension

(Quarter->Month).
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on
the OLAP cube. It can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by
climbing up in the concept hierarchy of Location dimension (City ->Country).

3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more
dimensions. In the cube given in the overview section, a sub-cube is selected by
selecting following dimensions with criteria:
 Location=“Delhi”or“Kolkata”
 Time=“Q1”or“Q2”
 Item=“Car”or“Bus”

4. Slice:ItselectsasingledimensionfromtheOLAPcubewhichresultsinanewsub-
cubecreation.Inthecubegivenintheoverviewsection,Sliceisperformedonthe

dimensionTime=“Q1”.
5. Pivot: It is also known as rotation operation as it rotates the current view to get a
new view of the representation. In the sub-cube obtained after the slice operation,
performing pivot operation gives a new view of it.
Lecture topic 5: Types of OLAP
There are three main types of OLAP servers are as following:

ROLAP stands for Relational OLAP, an application based on relational DBMSs.

MOLAP stands for Multidimensional OLAP, an application based on multidimensional DBMSs.

HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional
techniques.

Relational OLAP(ROLAP)Server
These are intermediate servers which stand in between a relational back-end server and user frontend
tools.

They use a relational or extended-relational DBMS to save and handle ware house data, and
OLAP middleware to provide missing pieces.

ROLAP servers contain optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.

ROLAP technology tends to have higher scalability than MOLAP technology.


ROLAP systems work primarily from the data that resides in a relational database, where the
base data and dimension tables are stored as relational tables. This model permits the
multidimensional analysis of data.

This technique relies on manipulating the data stored in the relational database to give the
presence of traditional OLAP's slicing and dicing functionality. In essence, each method of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Relational OLAP Architecture


ROLAP Architecture includes the following components

o Database server.
o ROLAP server.
o Front-end tool.

Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology segment in the
market. This method allows multiple multidimensional views of two-dimensional relational
tables to be created, avoiding structuring record around the desired view.

Some products in this segment have supported reliable SQL engines to help the complexity of
multidimensional analysis. This includes creating multiple SQL statements to handle user
requests, being 'RDBMS' aware and also being capable of generating the SQL statements based
on the optimizer of the DBMS engine.
Advantages
Can handle large amounts of information: The data size limitation of ROLAP technology is
depends on the data size of the underlying RDBMS. So, ROLAP itself does not restrict the data
amount.

<="" strong="">RDBMS already comes with a lot of features. So ROLAP technologies, (works
on top of the RDBMS) can control these functionalities.

Disadvantages
Performance can be slow: Each ROLAP report is a SQL query(or multiple SQL queries) in the
relational database, the query time can be prolonged if the underlying data size is large.

Limited by SQL functionalities: ROLAP technology relies on upon developing SQL


statements to query the relational database, and SQL statements do not suit all needs.

Lecture Topic 6: Multidimensional OLAP(MOLAP)Server


A MOLAP system is based on a native logical model that directly supports multidimensional
data and operations. Data are stored physically into multidimensional arrays, and positional
techniques are used to access them.

One of the significant distinctions of MOLAP against a ROLAP is that data are summarized
and are stored in an optimized format in a multidimensional cube, instead of in a relational
database. In MOLAP model, data are structured into proprietary formats by client's reporting
requirements with the calculations pre-generated on the cubes.

MOLAP Architecture
MOLAP Architecture includes the following components

o Database server.
o MOLAP server.
o Front-end tool.
MOLAP structure primarily reads the precompiled data. MOLAP structure has limited capabilities
to dynamically create aggregations or to evaluate results which have not been pre- calculated and
stored.

Applications requiring iterative and comprehensive time-series analysis of trends are well suited for
MOLAP technology (e.g., financial analysis and budgeting).

Advantages
Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal for
slicing and dicing operations.

Can perform complex calculations: All evaluation have been pre-generated when the cube is
created. Hence, complex calculations are not only possible, but they return quickly.

Disadvantages
Limited in the amount of information it can handle: Because all calculations are performed when
the cube is built, it is not possible to contain a large amount of data in the cube itself.

Requires additional investment: Cube technology is generally proprietary and does not already
exist in the organization. Therefore,to adopt MOLAP technology, chances are other investments
in human and capital resources are needed.

Hybrid OLAP(HOLAP)Server
HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture. HOLAP
systems save more substantial quantities of detailed data in the relational tables while
the aggregations are stored in the pre-calculated cubes. HOLAP also can drill through from the
cube down to the relational tables for delineated data. The Microsoft SQLServer2000 provides
a hybrid OLAP server.

Advantages of HOLAP
1. HOLAP provide benefits of both MOLAP and ROLAP.
2. It provides fast access at all levels of aggregation.
3. HOLAP balances the disk space requirement, as it only stores the aggregate information on the
OLAP server and the detail record remains in the relational database. So no duplicate copy of the
detail record is maintained.

Disadvantages of HOLAP
1. HOLAParchitectureisverycomplicatedbecauseitsupportsbothMOLAPandROLAPservers.
Difference between ROLAP, MOLAP, and HOLAP
ROLAP MOLAP HOLAP

ROLAP stands for Relational Online MOLAP stands for Multidimensional HOLAP stands for Hybrid
Analytical Processing. Online Analytical Processing. Online Analytical Processing.

The ROLAP storage mode causes the The MOLAP storage mode principle the The HOLAP storage mode
aggregation of the division to be aggregations of the division and a copy connects attributes of both
stored in indexed views in the of its source information to be saved in a MOLAP and ROLAP. Like
relational database that was multidimensional operation in analysis MOLAP, HOLAP causes the
specified in the partition's data services when the separation is aggregation of the division to
source. processed. be stored in a
multidimensional operation in
an SQL Server analysis
services instance.

ROLAP does not because a copy ofThis MOLAP operation is highly optimize toHOLAP does not causes a copy of
the source information to be maximize query performance. The the source information to be
stored in the Analysis services storage area can be on the computer stored. For queries that access
data folders. Instead, when the where the partition is described or on the only summary record in
outcome cannot be derived from another computer running Analysis the aggregations of a division,
the query cache, the indexed services. Because a copy of the source HOLAP is the equivalent of
views in the record source are information resides in the MOLAP.
accessed to answer queries. multidimensional operation, queries can
be resolved without accessing the
partition's source record.

Query response is frequently slower


Query response times can be reduced Queries that access source record
with ROLAP storage than with substantially by using aggregations. The for example, if we want to
the MOLAP or HOLAP storage record in the partition's MOLAP drill down to an atomic cube
mode. Processing time is also operation is only as current as of the cell for which there is no
frequently slower with most recent processing of the separation. aggregation information must
retrieve data
ROLAP. from the relational database
and will not be as fast as they
would be if the source
information were stored in the
MOLAP architecture.

You might also like