100% found this document useful (1 vote)
60 views52 pages

ETL Process for Data Warehousing

The document discusses Extract, Transform, Load (ETL) processes for data warehousing and business analytics. It covers extracting data from different source systems, transforming data to handle inconsistencies, and loading data into the data warehouse. It also discusses issues that can arise during ETL such as data quality problems and integrating data from diverse and legacy source systems.

Uploaded by

Thanh Hà Trần
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
60 views52 pages

ETL Process for Data Warehousing

The document discusses Extract, Transform, Load (ETL) processes for data warehousing and business analytics. It covers extracting data from different source systems, transforming data to handle inconsistencies, and loading data into the data warehouse. It also discusses issues that can arise during ETL such as data quality problems and integrating data from diverse and legacy source systems.

Uploaded by

Thanh Hà Trần
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Data Warehousing

and Business analytics

Lecture-3
Extract Transform Load (ETL)

1
Putting the pieces together

Data Data Warehouse Server OLAP Servers Clients


(Tier 0) (Tier 1) (Tier 2) (Tier 3)


Semistructured MOLAP
Sources Query/Reporting

www data
Meta
Data 
 Extract
Data 

Analysis

 
 


 Archived
Transform
 Load Warehouse Business
data ROLAP
(ETL)
IT Data Mining
Users
Users
Operational
Data Bases 

Data sources Data Marts  Tools
Business Users

{Comment: All except ETL washed out look}


2
The ETL Cycle
EXTRACT TRANSFORM LOAD
The process of reading The process of transforming The process of writing
data from different the extracted data from its the data into the target
sources. original state into a consistent source.
state so that it can be placed
into another database.

 MIS Systems
TRANSFORM CLEANSE Data Warehouse
www data (Acct, HR)

Legacy
Systems
EXTRACT LOAD
Archived data


Other indigenous applications
(COBOL, VB, C++, Java)
OLAP
Temporary Data storage
3
ETL Processing
ETL is independent yet interrelated steps.
It is important to look at the big picture.
Data acquisition time may include…

Extracts
from Data Index
Data Data Data Statistics
source Transfor-
Movement Cleansing Loading Maintenance Collection
systems mation

Note: Backup will come as other


elements after “Statistical collection”
Backup

Back-up is a major task, its a DWH not a cube 4


Overview of Data Extraction

First step of ETL, followed by many.

Source system for extraction are typically OLTP


systems.

A very complex task due to number of reasons:


 Very complex and poorly documented source system.
 Data has to be extracted not once, but number of times.

The process design is dependent on:
 Which extraction method to choose?
 How to make available extracted data for further
processing?
5
Types of Data Extraction

 Logical Extraction
 Full Extraction
 Incremental Extraction

 Physical Extraction
 Online Extraction
 Offline Extraction
 Legacy vs. OLTP
6
Logical Data Extraction
 Full Extraction
 The data extracted completely from the source system.

 No need to keep track of changes.

 Source data made available as-is with any additional information.

 Incremental Extraction
 Data extracted after a well defined point/event in time.

 Mechanism used to reflect/record the temporal changes in data (column


or table).

 Sometimes entire tables off-loaded from source system into the DWH.

 Can have significant performance impacts on the data warehouse


server.

7
Physical Data Extraction…
 Online Extraction
 Data extracted directly from the source system.
 May access source tables through an intermediate system.
 Intermediate system usually similar to the source system.

 Offline Extraction
 Data NOT extracted directly from the source system, instead staged
explicitly outside the original source system.

 Data is either already structured or was created by an extraction


routine.

 Some of the prevalent structures are:


 Flat files
 Dump files
 Redo and archive logs
 Transportable table-spaces
8
Physical Data Extraction

 Legacy vs. OLTP

 Data moved from the source system

 Copy made of the source system data

 Staging area used for performance reasons

9
Data Transformation

 Basic tasks
1. Selection

2. Splitting/Joining

3. Conversion

4. Summarization

5. Enrichment
10
Aspects of Data Loading Strategies
 Need to look at:
 Data freshness
 System performance
 Data volatility

 Data Freshness
 Very fresh low update efficiency
 Historical data, high update efficiency
 Always trade-offs in the light of goals

 System performance
 Availability of staging table space
 Impact on query workload

 Data Volatility
 Ratio of new to historical data
 High percentages of data change (batch update)
11
Three Loading Strategies
• Once we have transformed data, there are three
primary loading strategies:

• Full data refresh with BLOCK INSERT or ‘block slamming’


into empty table.

• Incremental data refresh with BLOCK INSERT or ‘block


slamming’ into existing (populated) tables.

• Trickle/continuous feed with constant data collection


and loading using row level insert and update
operations.

12
ETL Issues
Data from different source systems will be
different, poorly documented and dirty. Lot of
analysis required.

Easy to collate addresses and names? Not


really. No address or name standards.

Use software for standardization. Very


expensive, as any “standards” vary from
country to country, not large enough market.
13
ETL Issues
Things would have been simpler in the presence of
operational systems, but that is not always the case

Manual data collection and entry. Nothing wrong with


that, but potential to introduces lots of problems.

Data is never perfect. The cost of perfection,


extremely high vs. its value.

14
“Some” Issues
 Usually, if not always underestimated
 Diversity in source systems and platforms
 Inconsistent data representations
 Complexity of transformations
 Rigidity and unavailability of legacy systems
 Volume of legacy data
 Web scrapping

15
 Work seems to be deceptively simple.

 People start manually building the DWH.

 Programmers underestimate the task.

 Impressions could be deceiving.

 Traditional DBMS rules and concepts break down


for very large heterogeneous historical databases.

16
Platform OS DBMS MIS/ERP
Main Frame VMS Oracle SAP
Mini Computer Unix Informix PeopleSoft
Desktop Win NT Access JD Edwards
DOS Text file

Dozens of source systems across organizations

Numerous source systems within an organization

Need specialist for each

17
Same data, different representation
Date value representations
Examples:
970314 1997-03-14
03/14/1997 14-MAR-1997
March 14 1997 2450521.5 (Julian date format)

Gender value representations


Examples:
- Male/Female - M/F
- 0/1 - PM/PF

18
Need to rank source systems on a per data element basis.

Take data element from source system with highest rank where
element exists.

“Guessing” gender from name

Something is better than nothing?

Must sometimes establish “group ranking” rules to maintain


data integrity.

First, middle and family name from two systems of different


rank. People using middle name as first name.
19
Simple one-to-one scalar transformations
- 0/1 → M/F

One-to-many element transformations


- 4 x 20 address field → House/Flat, Road/Street,
Area/Sector, City.

Many-to-many element transformations


- House-holding (who live together) and
individualization (who are same) and same lands.

20
 Very difficult to add logic to or increase performance of
legacy systems.

 Utilization of expensive legacy systems is optimized.

 Therefore, want to off-load transformation cycles to


open systems environment.

 This often requires new skill sets.

 Need efficient and easy way to deal with incompatible


mainframe data formats.

21
 Talking about not weekly data, but data spread over
years.

 Historical data on tapes that are serial and very


slow to mount etc.

 Need lots of processing and I/O to effectively handle


large data volumes.

 Need efficient interconnect bandwidth to transfer


large amounts of data from legacy sources to DWH.

22
▣ Lot of data in a web page, but is mixed with a lot of “junk”.

▣ Problems:
◾ Limited query interfaces
 Fill in forms
◾ “Free text” fields
 E.g. addresses
◾ Inconsistent output
 i.e., html tags which mark interesting fields might be
different on different pages.
◾ Rapid change without notice.

23
Beware of data quality
 Data quality is always worse than expected.

 Will have a couple of lectures on data quality and


its management.

 It is not a matter of few hundred rows.

 Data recorded for running operations is not usually


good enough for decision support.
 Correct totals don’t guarantee data quality.
 Not knowing gender does not hurt POS.
 Centurion customers popping up.
24
ETL vs. ELT
There are two fundamental approaches to data
acquisition:

ETL: Extract, Transform, Load in which data


transformation takes place on a separate
transformation server.

ELT: Extract, Load, Transform in which data


transformation takes place on the data warehouse
server.

Combination of both is also possible


25
Extracting Changed Data
Incremental data extraction
Incremental data extraction i.e. what has changed, say during last 24
hrs if considering nightly extraction.

Efficient when changes can be identified


This is efficient, when the small changed data can be identified
efficiently.

Identification could be costly


Unfortunately, for many source systems, identifying the recently
modified data may be difficult or effect operation of the source
system.

Very challenging
Change Data Capture is therefore, typically the most challenging
technical issue in data extraction.

26
Source Systems

Two CDC sources


• Modern systems
• Legacy systems

27
CDC in Modern Systems
• Time Stamps
• Works if timestamp column present
• If column not present, add column
• May not be possible to modify table, so add triggers

• Triggers
• Create trigger for each source table
• Following each DML operation trigger performs updates
• Record DML operations in a log

• Partitioning
• Table range partitioned, say along date key
• Easy to identify new data, say last week’s data

28
CDC in Legacy Systems
• Changes recorded in tapes Changes occurred in legacy transaction
processing are recorded on the log or journal tapes.

• Changes read and removed from tapes Log or journal tape are read
and the update/transaction changes are stripped off for movement
into the data warehouse.

• Problems with reading a log/journal tape are many:


• Contains lot of extraneous data
• Format is often arcane
• Often contains addresses instead of data values and keys
• Sequencing of data in the log tape often has deep and complex implications
• Log tape varies widely from one DBMS to another.

29
CDC Advantages: Modern Systems

Advantages

1. Immediate.
Modern
2. No loss of history
Systems
3. Flat files NOT required

30
CDC Advantages: Legacy Systems

Advantages

1. No incremental on-line I/O required for log tape

2. The log tape captures all update processing


Legacy
3. Log tape processing can be taken off-line. Systems

4. No haste to make waste.

31
Major Transformation Types

 Format revision
 Decoding of fields
 Calculated and derived values
 Splitting of single fields
 Merging of information
 Character set conversion
 Unit of measurement conversion
 Date/Time conversion
 Summarization
 Key restructuring
 Duplication
32
Major Transformation Types

 Format revision

 Decoding of fields
Covered in De-Norm

 Calculated and derived values


Covered in issues

 Splitting of single fields

33
Major Transformation Types

 Merging of information
Not really means combining columns to create one column.
Info for product coming from different sources merging it into single entity.

 Character set conversion


For PC architecture converting legacy EBCIDIC to ASCII

 Unit of measurement conversion


For companies with global branches Km vs. mile or lb vs Kg

 Date/Time conversion
November 14, 2005 as 11/14/2005 in US and 14/11/2005 in the British format.
This date may be standardized to be written as 14 NOV 2005.

34
Major Transformation Types

 Aggregation & Summarization


Adding
 How they are different? like values
Summarization with calculation across business dimension is
aggregation. Example Monthly compensation = monthly sale + bonus

Why both are required?


 Grain mismatch (don’t require, don’t have space)
 Data Marts requiring low detail
 Detail losing its utility
35
Major Transformation Types
 Key restructuring (inherent meaning at source)
92 42 4979 234
Country_Code City_Code Post_Code Product_Code

 i.e. 92424979234 changed to 12345678

 Removing duplication
Incorrect or missing value
Inconsistent naming convention ONE vs 1
Incomplete information
Physically moved, but address not changed
Misspelling or falsification of names 36
Data content defects

• Domain value redundancy


 Non-standard data formats
 Non-atomic data values
 Multipurpose data fields
 Embedded meanings
 Inconsistent data values
 Data quality contamination

37
Data content defects Examples

 Domain value redundancy


 Unit of Measure
 Dozen, Doz., Dz., 12

 Non-standard data formats


 Phone Numbers
 1234567 or 123.456.7

 Non-atomic data fields


 Name & Addresses
 Dr. Nguyen Van A, PhD
38
Data content defects Examples

 Embedded Meanings
 RC, AP, RJ
 received, approved, rejected

39
Background
• Other names: Called as data scrubbing or cleaning.
• More than data arranging: DWH is NOT just about arranging data,
but should be clean for overall health of organization. We drink
clean water!
• Big problem, big effect: Enormous problem, as most data is dirty.
GIGO
• Dirty is relative: Dirty means does not confirm to proper domain
definition and vary from domain to domain.
• Paradox: Must involve domain expert, as detailed domain
knowledge is required, so it becomes semi-automatic, but has to
be automatic because of large data sets.
• Data duplication: Original problem was removing duplicates in
one system, compounded by duplicates from many systems.

40
Lighter Side of Dirty Data
• Year of birth 1995 current year 2005
{Comment: Show picture of baby}

• Born in 1986 hired in 1985

• Who would take it seriously? Computers while


summarizing, aggregating, populating etc.

• Small discrepancies become irrelevant for large


averages, but what about sums, medians, maximum,
minimum etc.?
41
Serious Side of dirty data
• Decision making at the Government level on
investment based on rate of birth in terms of schools
and then teachers. Wrong data resulting in over and
under investment.

• Direct mail marketing sending letters to wrong


addresses retuned, or multiple letters to same
address, loss of money and bad reputation and wrong
identification of marketing region.

42
3 Classes of Anomalies…
▣ Syntactically Dirty Data
◾Lexical Errors
◾Irregularities

▣ Semantically Dirty Data


◾Integrity Constraint Violation
◾Business rule contradiction
◾Duplication

▣ Coverage Anomalies
◾Missing Attributes
◾Missing Records

43
3 Classes of Anomalies…
▣ Syntactically Dirty Data
◾Lexical Errors
◾Discrepancies between the structure of the data items
and the specified format of stored values
◾e.g. number of columns used are unexpected for a tuple
(mixed up number of attributes)
◾Irregularities
◾Non uniform use of units and values, such as only giving
annual salary but without info i.e. in US$ or PK Rs?

▣ Semantically Dirty Data


◾Integrity Constraint violation
◾Contradiction
DoB > Hiring date etc.
◾Duplication
44
3 Classes of Anomalies…

•Coverage Anomalies
• Missing Attribute
 Result of omissions while collecting the data.
 A constraint violation if we have null values for
attributes where NOT NULL constraint exists.
 Case more complicated where no such
constraint exists.
 Have to decide whether the value exists in the real
world and has to be deduced here or not.

45
Coverage Anomalies

• Equipment malfunction (bar code reader, keyboard


etc.)
• Inconsistent with other recorded data and thus
deleted.
• Data not entered due to misunderstanding/illegibility.
• Data not considered important at the time of entry
(e.g. Y2K).

46
Handling missing data

• Dropping records.
• “Manually” filling missing values.
• Using a global constant as filler.
• Using the attribute mean (or median) as filler.
• Using the most probable value as filler.

47
Key Based Classification of Problems

• Primary key problems

• Non-Primary key problems

48
Primary key problems

◾ Same PK but different data.

◾ Same entity with different keys.

◾ PK in one system but not in other.

◾ Same PK but in different formats.

49
Non primary key problems…

• Different encoding in different sources.


• Multiple ways to represent the same
information.
• Sources might contain invalid data.
• Two fields with different data but same name.

50
Non primary key problems

• Required fields left blank.

• Data erroneous or incomplete.

• Data contains null values.

51
Automatic Data Cleansing…

1.Statistical

2.Pattern Based

3.Clustering

4.Association Rules
52

You might also like