0% found this document useful (0 votes)

16 views22 pages

Session5 6 Etl

ETL (Extract, Transform, Load) is a process used to transfer data from OLTP systems to OLAP systems, focusing on extracting relevant data, transforming it for consistency and usability, and loading it into a data warehouse. Challenges in ETL include data quality, incremental loading, and managing data duplication, with various extraction and transformation methods employed to ensure data integrity. The document outlines the importance of determining necessary data, cleansing it, and applying business rules during the transformation stage before loading it into the warehouse.

Uploaded by

learnai.utkarshprasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views22 pages

Session5 6 Etl

Uploaded by

learnai.utkarshprasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Introduction to ETL

SESSION- 10
ETL – Extract, Transform, Load
 ETL is the process of extracting the data from various sources, transforming this data to meet your
requirement and then loading it into a target data warehouse.
What is ETL?

 OLTP cant be used for analytical purpose, a separate system needed called OLAP.
 There is a source (OLTP system) and there is a target (OLAP system), now how do we
transfer the data from one system to another?
 The set of methods is ETL process.
 Now there are 2 systems, why not directly copy the data from one system to another? It
becomes a easy process
 But, the answer is there the users will not be interested in all of the data that is recorded by
the transactional system. OLAP is a system utilized for strategy and decision making.
What is ETL?
 DWH is subject oriented.
 An OLTP system will contain data for employees of an organization as well as it contains data
about the customers of an organization.
 But the organization would need two different OLAP systems based on the above subjects, i.e.
employees , customers, hence the data has to be identified form the source system and only
relevant data should be loaded.

SUBJECT:
EMPLOYEE
EMPLOYEE OLAP SYSTEM

ORGANIZATION OLAP
SYSTEM
SUBJECT:
CUSTOMER OLAP SYSTEM CUSTOMER
What is ETL?
 One more issue is that some of the data in the OLTP system might not have proper
format.
 For example, phone number instead of having 10 digits may have only 3 digits or email
address might not be proper or blank, such data needs a filtered out
 Each OLTP system might have different formats. For example Banks might be based on
mainframes, but employee or CRMs can be in SAP ERP systems
 So prior loading, homogeneity of data should be maintained

SOURCE
ETL DATAWAREHOUSE

SOURCE
ETL Challenges
 The complexity of the ETL design depends on the complexity of the warehouse, the number of
OLTP systems that data has to be extracted from and their complexity, the quality of data that is
present in the OLTP system.
 Most of the DWH loading takes place at night, so that the end users can work during the day on
report generation
 A proper time slot has to be decided and if there are multiple application running on the server
and the server capacity is small, then it has to be synchronized so that load on the server is not
too much.
 Next issue is with the incremental load. Today’s data is already loaded into the DWH, there is no
point in loading the same data tomorrow, only the changes have to be loaded. This is also a
challenging task.
 Data duplication is another question. There is no point in loading the same data twice, as it will
take up more space on the DWH without serving any purpose.
 Then we come to the load process, millions of rows are loaded into the DWH, which takes hours
for the load process to complete.
Summary of ETL

 Determine the data you need and the data you don’t need in your target
 Determine from where (Source) the above data is going to come from
 Determine the data extraction methods, cleansing rules and transformation rules
 Load the dimension data and the fact data
Data Extraction

 There are multiple disparate sources in an OLTP system.

 Most of the systems especially in the financial services domain are legacy systems or
systems like mainframes that were designed long time ago and keeping with the times the
company may also have the newer systems like CRM or SAP etc.

 Connecting directly to mainframe systems is a challenging task

 Hence usually in a production environment, data is extracted within the mainframe

system and sent as files

 These files are then cleansed and loaded

Data Extraction
 The type of data extraction depends on how the data gets stored in the OLTP system
 For example consider the address (only state considered for simplicity) of a person
which changes over time:

Person ID 12345
 From Jan 1st 2000 to May 1st 2000 – Lived in Canada
 From May 2nd 2000 to June 5 2001 – Lived in New york
 From June 5 2001 till now – Living in London

Person ID 12346
 From May 4 2002 to Sept 4 2002 – California
 From Sept 4 2002 till now San Francisco
Data Extraction – Approach 1
 Approach 1 is to store only the most current information of each person ID
 As can be seen above at a given point of time there is just one record for a person ID.
At the end of every single day, the data is extracted and loaded in to the warehouse.

PERSON ID DATE STATE

12345 JAN 1 2000 CANADA
12346 MAY 4 2002 CALIFORNIA

PERSON ID DATE STATE

12345 MAY 2 2000 NEW YORK
12346 SEPT 4 2002 SAN FRANCISCO

PERSON ID DATE STATE

12345 JUNE 5 2001 LONDON
12346 SEPT 4 2002 SAN FRANCISCO
Data Extraction – Approach 2
 Approach 2 would be to store the entire history
 For this approach, all changes are extracted since the time of the last extract. This is called
Change Data Capture.
 There is usually a timestamp associated with the change that tells us when the change was
performed.
 For this to work,, logic has to be added at the sources which places additional burden on
the source.
 Hence usually, the entire data is extracted and sent and during loading process it is
compared with the already existing data to check if the data is the same.
 If it is the same, it is discarded and if there is a change, then it is entered into the OLAP
system.
PERSON ID Date State
12345 Jan 1 2000 Canada
12345 May 2 2000 New York
12345 June 5 2001 San Francisco
Data Extraction

 Usually jobs are scheduled during off peak hours to capture the data from the OLTP
systems and prepare a set of files
 Again jobs are scheduled to deliver the files to a particular directory in a particular
server (on which the ETL tool is installed)
 ETL process can be made file dependent or time dependent, i.e. the ETL process can
kick off as soon as the complete set of files arrives in the directory or they can kick off
at a predetermined time
 But if they are scheduled to kick off at a pre determined time say 3 AM in the morning,
and the source system is down at that time, and have not sent the files, then the ETL
process will run empty and may even fail.
Data Transformation

 Data type conversion is also done in the transformation

 There will be a mismatch between the data types of sources for a DWH
 For example the Name filed might have length of 100 characters or multiple columns
may have been joined to form one column of length 50. Char might get converted
into varchar.

Source
Warehouse
Dd/mm/yy
Dd/mm/yyyy
Data transformation

 Joining is one more example of transformation.

 Multiple sources may be used to load one particular table of a warehouse.
 One particular column may be split into multiple columns or multiple columns may be
joined to form one column.
 Example: The first name and last name be joined to form one column called name or
the name column may be split into first name and last name. Address may contain
country, state, city and zip coded together but at the warehouse a business user
might want to generate report based on only the zip code value. Hence the address
filed needs to be read from source and split into individual fields and populated.
Data Transformation
 Calculations or aggregations are performed on the source data
 For example: The sales data of a supermarket chain may be calculated at a month
level based on individual transactions from the source.
 The number of policies an agent has sold per state per month can be calculated. Profit
percentage on a quarterly basis may need to be calculated and stored in the fact
tables. All such calculations may be performed in the transformation stage.

Agent wise sales data

Sales Data

Increasing Profit
or not
Digital Transformation

 Business rules are applied on the data

 For example there might be a rule saying that one department should have only one
manager. We need to have checks to make sure that if there are two managers
coming for one particular department, then we populate the right one as per rule.

 Parent child relationships need to be maintained. For example, a record cannot be

loaded into table B as long as it is not loaded into table A. An update record cannot
go in if a corresponding insert record is not present in the table.
Data Transformation
 Decoding of certain values from the source is performed
 For example, a company may use codes in source system as 1,2,3, 4 to classify its
employees as Active, Inactive, Retired, Resigned. What do these abbreviations
mean?
 A business user who is generating the reports may not know the meaning of these
codes. These codes need ot be read from the source and have to be re-coded
and populated such that it is easy for the business user to understand.
 Key values might need to generated for the dimension tables.
 Every row in a dimension table needs to be uniquely identified and this key is used
in the fact table to identify the dimension to which it is linked.
 Hence keys may need to be generated in sequence and can be done in the
transformation stage.
 First the dimension table needs to be read, the maximum key value for that
dimension must be read, and the max value must be incremented by 1 for every
row.
Data Transformation
 Conversion may need to be performed

 If the company is operating in different geographies, the sources in different

geographies have their own metric system but reporting might be done at the
headquarters and they may have another set of metrics.

 Hence the metrics may need to be converted to the one used in the headquarters.

 Another commonly converted parameter is the date parameter. Different countries

use different date notations. For example, the US format is MM/DD/YYYY; the UK format
is DD/MM/YYYY. This may need to be standardized to a common format.
Data Loading
 Two ways of reading data from the source systems. Read everything for every run, or
read only the incremental changes.
 As is obvious the 2nd method is more efficient, let us look now at how incremental load
can be done in a warehouse. Consider the example:

Person ID Date State

1 1st Jan 2000 A$ Sheets
1 1st Jan 2000 Pen
2 1st Jan 2000 Pencils
3 2nd Jan 2000 Stapler
1 2nd Jan 2000 Notepad
4 3rd Jan 2000 Markers
 We have a table that details what product was sold to a customer of a stationery shop.
 One 1st Jan there were 3 entries made, 2 entries on 2 nd Jan and 1 entry on 3rd Jan. We
can use a query directly in the source system to fetch all records for a particular date.
But what will happen if there were no record on a particular date?
Data Loading
 For the 3rd jan record, we will fetch all records with data > 2 nd Jan, but what if 2nd
Jan run itself fails?
 As per the above logic, we will be losing 2 records that were entered on 2nd Jan.
 Thus, a separate table is created to store the run status, let us call it ‘Control Table’
BATCH_RUN_STATUS. The table will look something like below:
BATCH RUN ID DATE STATUS
1 1st Jan 2000 S
2 2nd Jan 2000 F
3 3rd Jan 2000 D

 The last column in the above table gives the status of the run for each status.
 1st Jan 2000 run was success hence S is entered, 2 nd Jan 2000 run was failed hence
F is entered, 3rd Jan batch is due to run hence D is entered.
 For the 3rd Jan batch runs all record with date > previous, SUCCESSFUL run are
fetched. In the example above all records > 1st Jan 2000 are fetched and loaded.
References

 https://www.matillion.com/case-studies/cisco/
 https://riskspan.com/etl-solutions/
 https://www.geeksforgeeks.org/etl-process-in-data-warehouse/
 https://www.javatpoint.com/etl-process-in-data-warehouse
 https://www.wisdomjobs.com/e-university/data-warehouse-etl-toolkit-
tutorial-201/designing-the-staging-area-8034.html

Why ETL
No ratings yet
Why ETL
15 pages
Extract, Transform, Load
No ratings yet
Extract, Transform, Load
8 pages
Integrasi Data Dan ETL
No ratings yet
Integrasi Data Dan ETL
45 pages
Data Warehousing: Lecture No 07
No ratings yet
Data Warehousing: Lecture No 07
38 pages
ETL Process in Data Warehouse: Click To Add Text Chirayu Poundarik
No ratings yet
ETL Process in Data Warehouse: Click To Add Text Chirayu Poundarik
37 pages
ETL Process in Data Warehouse
67% (3)
ETL Process in Data Warehouse
40 pages
Outline: ETL Extraction Transformation Loading
No ratings yet
Outline: ETL Extraction Transformation Loading
38 pages
ETL Process in Data Warehousing
No ratings yet
ETL Process in Data Warehousing
29 pages
ETL Basics for Data Warehousing
No ratings yet
ETL Basics for Data Warehousing
27 pages
ETL Process for Data Experts
No ratings yet
ETL Process for Data Experts
40 pages
ETL Process for Data Warehousing
100% (1)
ETL Process for Data Warehousing
52 pages
Imran Introduction To DWH-5
No ratings yet
Imran Introduction To DWH-5
26 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
04 - ETL Process
No ratings yet
04 - ETL Process
40 pages
Bi Unit 3
No ratings yet
Bi Unit 3
26 pages
ETL Process in Data Warehouse: Chirayu Poundarik
No ratings yet
ETL Process in Data Warehouse: Chirayu Poundarik
40 pages
Lect#4
No ratings yet
Lect#4
27 pages
Basics of Data Integration
100% (1)
Basics of Data Integration
61 pages
Module 3
No ratings yet
Module 3
30 pages
ETL Basic Concepts
No ratings yet
ETL Basic Concepts
63 pages
Cs614 (Final)
No ratings yet
Cs614 (Final)
29 pages
ETL Basics and Testing Guide
No ratings yet
ETL Basics and Testing Guide
19 pages
Online Analytical Processing: OLAP (Or Online Analytical Processing) Has Been Growing in Popularity Due To The
No ratings yet
Online Analytical Processing: OLAP (Or Online Analytical Processing) Has Been Growing in Popularity Due To The
12 pages
Advanced ETL Techniques for DWH
No ratings yet
Advanced ETL Techniques for DWH
46 pages
Unit III DWM
No ratings yet
Unit III DWM
13 pages
LP-VI Handwritten Writeups
No ratings yet
LP-VI Handwritten Writeups
9 pages
06-Data-Integration Quality Profiling
No ratings yet
06-Data-Integration Quality Profiling
39 pages
Chapter IV
No ratings yet
Chapter IV
22 pages
Lecture 7 (17-04-2024)
No ratings yet
Lecture 7 (17-04-2024)
29 pages
DWH and Testing1
No ratings yet
DWH and Testing1
11 pages
ETL Process and Best Practices
No ratings yet
ETL Process and Best Practices
31 pages
DW Training
No ratings yet
DW Training
31 pages
Data Warehousing Extract, Transform and Load (ETL)
No ratings yet
Data Warehousing Extract, Transform and Load (ETL)
32 pages
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
No ratings yet
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
36 pages
ETL Essentials for Businesses
No ratings yet
ETL Essentials for Businesses
5 pages
ETL Process
No ratings yet
ETL Process
11 pages
Presentation 2
No ratings yet
Presentation 2
22 pages
Lecture-9 Extraction Transformation Loading
No ratings yet
Lecture-9 Extraction Transformation Loading
15 pages
Bases de Dados e Armazéns de Dados: Bibliography
No ratings yet
Bases de Dados e Armazéns de Dados: Bibliography
11 pages
ETL Testing
No ratings yet
ETL Testing
12 pages
Lecture 16
No ratings yet
Lecture 16
21 pages
Data Extraction Part1
No ratings yet
Data Extraction Part1
15 pages
Data Extraction
No ratings yet
Data Extraction
153 pages
E Xtract T Ransform L OAD: MIS Systems (Acct, HR) Legacy Systems
No ratings yet
E Xtract T Ransform L OAD: MIS Systems (Acct, HR) Legacy Systems
30 pages
ELT Process
No ratings yet
ELT Process
80 pages
DWM Unit 1
No ratings yet
DWM Unit 1
48 pages
Data Warehouse Slide3
No ratings yet
Data Warehouse Slide3
43 pages
ETL Basics
No ratings yet
ETL Basics
6 pages
ETL Challenges for Data Warehousing
No ratings yet
ETL Challenges for Data Warehousing
16 pages
DWH Concepts Overview
No ratings yet
DWH Concepts Overview
11 pages
ETL Testing and Datawarehouse Testing
100% (1)
ETL Testing and Datawarehouse Testing
15 pages
ETL:Introduction
100% (1)
ETL:Introduction
22 pages
All Bi
No ratings yet
All Bi
17 pages
Software Update - 1100 Series
No ratings yet
Software Update - 1100 Series
5 pages
Salary Budgeting PDF
100% (1)
Salary Budgeting PDF
54 pages
03 1981 Westfalia Joker Owners ManualEnglishWM
No ratings yet
03 1981 Westfalia Joker Owners ManualEnglishWM
19 pages
Efficiency Optimization of Wireless Power Transfer System For Electric Vehicle Based On Improved Marine Predators Algorithm
No ratings yet
Efficiency Optimization of Wireless Power Transfer System For Electric Vehicle Based On Improved Marine Predators Algorithm
12 pages
Noida Commercial Data
No ratings yet
Noida Commercial Data
355 pages
FPGA-Based Spiking Neural Networks
No ratings yet
FPGA-Based Spiking Neural Networks
8 pages
2MP IR Cube Network Camera Specs
No ratings yet
2MP IR Cube Network Camera Specs
5 pages
Mamta Tiwari-Sr Talent Acquisition Specialist
No ratings yet
Mamta Tiwari-Sr Talent Acquisition Specialist
5 pages
Bosch Dishwasher Service Training Manual - Part6 PDF
No ratings yet
Bosch Dishwasher Service Training Manual - Part6 PDF
4 pages
Classical IPC Problems
No ratings yet
Classical IPC Problems
15 pages
History Fair Sources Final Draft
No ratings yet
History Fair Sources Final Draft
8 pages
Active VCs
No ratings yet
Active VCs
11 pages
CSS-Grade 9 Quarter 1
83% (12)
CSS-Grade 9 Quarter 1
44 pages
Penelusuran Alumni Mading
No ratings yet
Penelusuran Alumni Mading
4 pages
ALN-EN - Smoke Detector Specification PDF
No ratings yet
ALN-EN - Smoke Detector Specification PDF
1 page
Form1583 HUAQIANGYIN
No ratings yet
Form1583 HUAQIANGYIN
4 pages
A Theory of Generic Interpreters
No ratings yet
A Theory of Generic Interpreters
14 pages
Test IO Case Study DriveTime en
No ratings yet
Test IO Case Study DriveTime en
5 pages
Valvula de Expansión Electronica
No ratings yet
Valvula de Expansión Electronica
4 pages
Danielle Full Conversation
No ratings yet
Danielle Full Conversation
5 pages
Management Science Notes
No ratings yet
Management Science Notes
13 pages
ICU SA RQ 004 - 1.7.0 ICE Cubes Facility To Experiment Cube IRD
No ratings yet
ICU SA RQ 004 - 1.7.0 ICE Cubes Facility To Experiment Cube IRD
55 pages
Important Notes About This Schematic: at Ecs Confidential @
No ratings yet
Important Notes About This Schematic: at Ecs Confidential @
35 pages
Mgt4216e Strategic Management - Ia
No ratings yet
Mgt4216e Strategic Management - Ia
6 pages
Chemical Process Control Guide
No ratings yet
Chemical Process Control Guide
6 pages
Mustafa Mohammed Mohammed Abdel-Rahim
No ratings yet
Mustafa Mohammed Mohammed Abdel-Rahim
1 page
Bayesia Lab Car Diagnosis
No ratings yet
Bayesia Lab Car Diagnosis
86 pages
Compact Panel Dual Polarization Half-Power Beam Width Fixed Electrical Downtilt
No ratings yet
Compact Panel Dual Polarization Half-Power Beam Width Fixed Electrical Downtilt
1 page
Radiographic Testing Level 3 Course
No ratings yet
Radiographic Testing Level 3 Course
4 pages
Matrices One Shot #BB
100% (1)
Matrices One Shot #BB
158 pages

Session5 6 Etl

Uploaded by

Session5 6 Etl

Uploaded by

Introduction to ETL

 There are multiple disparate sources in an OLTP system.

 Connecting directly to mainframe systems is a challenging task

 Hence usually in a production environment, data is extracted within the mainframe

 These files are then cleansed and loaded

PERSON ID DATE STATE

PERSON ID DATE STATE

PERSON ID DATE STATE

 Data type conversion is also done in the transformation

 Joining is one more example of transformation.

Agent wise sales data

 Business rules are applied on the data

 Parent child relationships need to be maintained. For example, a record cannot be

 If the company is operating in different geographies, the sources in different

 Another commonly converted parameter is the date parameter. Different countries

Person ID Date State

You might also like