ETL- Extraction,Transformation & Loading
What is ETL?
• ETL(Extraction, Transformation and Loading) is a process by which
data is integrated and transformed from the operational systems into
the Data warehouse environment
Filters and
Extractors
Cleanser
Error
Error
Operational systems Cleaning View
Rules Check
• Rule 1 Correct
• Rule 2
• Rule 3
Transformation
Rules
• Rule 1
• Rule 2
• Rule 3
Transformation
Engine
Integrator
Error
View
Check
Correct Loader Warehouse
1-2
Operational Data - Challenges
• Data from heterogeneous sources
• Format differences
• Data Variations
• Context
– Across locations the same code could
represent different customers
– Across periods of time a product code could
have been reused
1-3
Extraction
Data
from
80 tables tables 30
Filter
Oracle
Data from 10
tables Where
Date<10/12/99
50 tables
f iles
Sybase ta f rom
Da
Target
1-4 Text files
Transformation
Source
Emp id Last First
Name Name
10001 Jones Indiana
10002 Holmes Sherlock
Staging Area
Name =
Concat(First Name,
Last Name)
Indiana Jones
Sherlock Homes
1-5
Loading
Data
Source Warehouse
Direct Load
Staging
Area g rated
i nt e
rm ed &
ra n sfo load
l ean,T data
C
Cleaning,
Transformation
& Integration of
Raw data
1-6
Volume of ETL in a Data warehouse
Source OLTP
Systems Data Marts
Metadata
e re
is h
Enterprise
or k
Data Warehouse
e w
o f th
0 %
o 8
0 t
6
•Extract •Load
•Design •Extract •Load •Replication •Access & Analysis
•Design •Scrub •Index •Replication •Access & Analysis
•Mapping •Scrub •Index •Data Set Distribution •Resource Scheduling & Distributio
•Mapping •Transform •Aggregation •Data Set Distribution •Resource Scheduling & Distributio
•Transform •Aggregation
Meta
MetaData
Data
System
SystemMonitoring
Monitoring
1-7
Factors Influencing ETL Architecture
• Volume at each warehouse component.
• The time window available for extraction.
• The extraction type (Full,Periodic etc.)
• Complexity of the processes at each stage.
1-8
Extraction Types
Extraction Types
Extraction
Periodic/
Full Extract Incremental
Extract
1-10
Full Extract
Existing data
Data Mart
Full Extract
Source System
1-11
Full Extract
New data
Data Mart
Full Extract
Source System
1-12
Incremental Extract
Existing data
Incremental
Data
Data Mart
Incremental Extract
Source System
1-14
Incremental Extract
Existing data
New data
Incremental
Data
Data Mart
Incremental Extract
Source System
Changed data
1-15
Incremental Extract
New data Incremental addition
Incremental to data mart
Data
Data Mart
Source System
Incremental Extract
Changed data
Existing data updated
using changed data
1-16
Transformation
Data Transformation
• Conversions
– Data type (e.g. Char to Date)
– Bring data to common units (Currency,Measuring Units)
• Classifications
– Changing continuous values to discrete ranges (e.g.
Temperatures to Temperature Ranges)
• Splitting of fields
• Merging of fields
• Aggregations (e.g. Sum, Avg., Count)
• Derivations (Percentages, Ratios, Indicators)
1-18
Structural Transformations
• Additive
OLTP
Orders arrive
Data ware
every Aggregate house
two minutes
OLTP
Average
Daily
Productivity Average Data ware
figures house
1-19
Format transformation
Source Target
Schema Schema
Data Type Transformation
Conversions “32” 32
Age as a String Age as an
Integer
Source Target
Splitting Schema Schema
Transformation
“15-10-1992” 15 10 1999
Day Month Year
Date as a Date as a combination of 3
String integer fields
1-20
Simple Conversions
Source Target
Schema Schema
Multiply by 1/43
Rs. 10000 $232.56
Revenue in Revenue in
Rupees Dollars
Multiply by 0.4536
1000 lbs. 453.56 kgs.
Production in Production in
Pounds Kilograms
Source Target
Schema Schema
• Transformations using Simple Conversions
1-21
Classification
Name Age
John Black 27
Richard Wayne 53
Age GroupFrequency
Jennifer Goldman 45 20-25 1
Helmut Koch 37 26-30 4
Anna Ludwig 32
Shito Maketha 28 31-35 3
Tracy Withman 39 36-40 2
Ada Zhesky 25 Grouping
41-45 2
David Rosenberg 33
Pankaj Sharma 29 46-50 1
Zhu Ling 44 51-55 1
George Kurtz 27
Rita Hartman 34
56-60 0
1-22
Data Consistency Transformations
Source 1 Source 2 Source 1
Gender Gender Gender
Male – M Male – Male Male – 1
Female – F Female – Female Female – 2
Target
Gender
Male – M
Female – F
1-23
Reconciliation of Duplicated data
Joseph J.R.Smith Joe Smith
Smith 123 Maine 123 Maine
123 Maine St. St.
St. MA - MA -
MA - 70127 70127
70127
Joseph R Smith
123 Maine St.
MA - 70127
1-24
Data Aggregation - Design Requirements
• Aggregates must be stored in their own fact tables and each level
should have its own fact table
• Dimension tables attached to the aggregate fact tables should where
ever possible be shrunken versions of the dimension tables attached to
the base fact table
• The base fact table and all of its related aggregate fact tables must be
associated together as a family of schemas
1-25
Loading
Types of Data warehouse Loading
• Target update types
– Insert
– Update
1-27
Types of Data Warehouse Updates
Data Warehouse
Source data Data Staging
Point in Time Snapshots Insert
New Data Full Replace
Changed Data Selective Replace
Update
Update plus Retain
History
1-28
New Data and Point-In-Time Data Insert
Source data
New data
OR
Point-in-Time
Snapshot New Data Added to
(e.g.. Monthly) Existing Data
1-29
Changed Data Insert
Source data
Changed Data Added to
Existing Data
Changed
data
1-30
Change of Dimension values
When the value of dimension in a data warehouse changes,
then
History of change needs to be maintained.
Changed data alone needs to be identified
Changed data should be easier to access.
Reconstruction of the dimension table any point in time should be
easier
1-31
ETL - Approach in a nutshell
1) Identify the Operational systems based on data islands in the
target
2) Map source-target dependencies.
3) Define cleaning and transformation rules
4) Validate source-target mapping
5) Consolidate Meta data for ETL
6) Draw the ETL architecture
7) Build the cleaning, transformation and auditing routines using
either a tool or customized programs
1-32