Data Warehousing
and Business analytics
Lecture-3
Extract Transform Load (ETL)
1
Putting the pieces together
Data Data Warehouse Server OLAP Servers Clients
(Tier 0) (Tier 1) (Tier 2) (Tier 3)
Semistructured MOLAP
Sources Query/Reporting
www data
Meta
Data
Extract
Data
Analysis
Archived
Transform
Load Warehouse Business
data ROLAP
(ETL)
IT Data Mining
Users
Users
Operational
Data Bases
Data sources Data Marts Tools
Business Users
{Comment: All except ETL washed out look}
2
The ETL Cycle
EXTRACT TRANSFORM LOAD
The process of reading The process of transforming The process of writing
data from different the extracted data from its the data into the target
sources. original state into a consistent source.
state so that it can be placed
into another database.
MIS Systems
TRANSFORM CLEANSE Data Warehouse
www data (Acct, HR)
Legacy
Systems
EXTRACT LOAD
Archived data
Other indigenous applications
(COBOL, VB, C++, Java)
OLAP
Temporary Data storage
3
ETL Processing
ETL is independent yet interrelated steps.
It is important to look at the big picture.
Data acquisition time may include…
Extracts
from Data Index
Data Data Data Statistics
source Transfor-
Movement Cleansing Loading Maintenance Collection
systems mation
Note: Backup will come as other
elements after “Statistical collection”
Backup
Back-up is a major task, its a DWH not a cube 4
Overview of Data Extraction
First step of ETL, followed by many.
Source system for extraction are typically OLTP
systems.
A very complex task due to number of reasons:
Very complex and poorly documented source system.
Data has to be extracted not once, but number of times.
The process design is dependent on:
Which extraction method to choose?
How to make available extracted data for further
processing?
5
Types of Data Extraction
Logical Extraction
Full Extraction
Incremental Extraction
Physical Extraction
Online Extraction
Offline Extraction
Legacy vs. OLTP
6
Logical Data Extraction
Full Extraction
The data extracted completely from the source system.
No need to keep track of changes.
Source data made available as-is with any additional information.
Incremental Extraction
Data extracted after a well defined point/event in time.
Mechanism used to reflect/record the temporal changes in data (column
or table).
Sometimes entire tables off-loaded from source system into the DWH.
Can have significant performance impacts on the data warehouse
server.
7
Physical Data Extraction…
Online Extraction
Data extracted directly from the source system.
May access source tables through an intermediate system.
Intermediate system usually similar to the source system.
Offline Extraction
Data NOT extracted directly from the source system, instead staged
explicitly outside the original source system.
Data is either already structured or was created by an extraction
routine.
Some of the prevalent structures are:
Flat files
Dump files
Redo and archive logs
Transportable table-spaces
8
Physical Data Extraction
Legacy vs. OLTP
Data moved from the source system
Copy made of the source system data
Staging area used for performance reasons
9
Data Transformation
Basic tasks
1. Selection
2. Splitting/Joining
3. Conversion
4. Summarization
5. Enrichment
10
Aspects of Data Loading Strategies
Need to look at:
Data freshness
System performance
Data volatility
Data Freshness
Very fresh low update efficiency
Historical data, high update efficiency
Always trade-offs in the light of goals
System performance
Availability of staging table space
Impact on query workload
Data Volatility
Ratio of new to historical data
High percentages of data change (batch update)
11
Three Loading Strategies
• Once we have transformed data, there are three
primary loading strategies:
• Full data refresh with BLOCK INSERT or ‘block slamming’
into empty table.
• Incremental data refresh with BLOCK INSERT or ‘block
slamming’ into existing (populated) tables.
• Trickle/continuous feed with constant data collection
and loading using row level insert and update
operations.
12
ETL Issues
Data from different source systems will be
different, poorly documented and dirty. Lot of
analysis required.
Easy to collate addresses and names? Not
really. No address or name standards.
Use software for standardization. Very
expensive, as any “standards” vary from
country to country, not large enough market.
13
ETL Issues
Things would have been simpler in the presence of
operational systems, but that is not always the case
Manual data collection and entry. Nothing wrong with
that, but potential to introduces lots of problems.
Data is never perfect. The cost of perfection,
extremely high vs. its value.
14
“Some” Issues
Usually, if not always underestimated
Diversity in source systems and platforms
Inconsistent data representations
Complexity of transformations
Rigidity and unavailability of legacy systems
Volume of legacy data
Web scrapping
15
Work seems to be deceptively simple.
People start manually building the DWH.
Programmers underestimate the task.
Impressions could be deceiving.
Traditional DBMS rules and concepts break down
for very large heterogeneous historical databases.
16
Platform OS DBMS MIS/ERP
Main Frame VMS Oracle SAP
Mini Computer Unix Informix PeopleSoft
Desktop Win NT Access JD Edwards
DOS Text file
Dozens of source systems across organizations
Numerous source systems within an organization
Need specialist for each
17
Same data, different representation
Date value representations
Examples:
970314 1997-03-14
03/14/1997 14-MAR-1997
March 14 1997 2450521.5 (Julian date format)
Gender value representations
Examples:
- Male/Female - M/F
- 0/1 - PM/PF
18
Need to rank source systems on a per data element basis.
Take data element from source system with highest rank where
element exists.
“Guessing” gender from name
Something is better than nothing?
Must sometimes establish “group ranking” rules to maintain
data integrity.
First, middle and family name from two systems of different
rank. People using middle name as first name.
19
Simple one-to-one scalar transformations
- 0/1 → M/F
One-to-many element transformations
- 4 x 20 address field → House/Flat, Road/Street,
Area/Sector, City.
Many-to-many element transformations
- House-holding (who live together) and
individualization (who are same) and same lands.
20
Very difficult to add logic to or increase performance of
legacy systems.
Utilization of expensive legacy systems is optimized.
Therefore, want to off-load transformation cycles to
open systems environment.
This often requires new skill sets.
Need efficient and easy way to deal with incompatible
mainframe data formats.
21
Talking about not weekly data, but data spread over
years.
Historical data on tapes that are serial and very
slow to mount etc.
Need lots of processing and I/O to effectively handle
large data volumes.
Need efficient interconnect bandwidth to transfer
large amounts of data from legacy sources to DWH.
22
▣ Lot of data in a web page, but is mixed with a lot of “junk”.
▣ Problems:
◾ Limited query interfaces
Fill in forms
◾ “Free text” fields
E.g. addresses
◾ Inconsistent output
i.e., html tags which mark interesting fields might be
different on different pages.
◾ Rapid change without notice.
23
Beware of data quality
Data quality is always worse than expected.
Will have a couple of lectures on data quality and
its management.
It is not a matter of few hundred rows.
Data recorded for running operations is not usually
good enough for decision support.
Correct totals don’t guarantee data quality.
Not knowing gender does not hurt POS.
Centurion customers popping up.
24
ETL vs. ELT
There are two fundamental approaches to data
acquisition:
ETL: Extract, Transform, Load in which data
transformation takes place on a separate
transformation server.
ELT: Extract, Load, Transform in which data
transformation takes place on the data warehouse
server.
Combination of both is also possible
25
Extracting Changed Data
Incremental data extraction
Incremental data extraction i.e. what has changed, say during last 24
hrs if considering nightly extraction.
Efficient when changes can be identified
This is efficient, when the small changed data can be identified
efficiently.
Identification could be costly
Unfortunately, for many source systems, identifying the recently
modified data may be difficult or effect operation of the source
system.
Very challenging
Change Data Capture is therefore, typically the most challenging
technical issue in data extraction.
26
Source Systems
Two CDC sources
• Modern systems
• Legacy systems
27
CDC in Modern Systems
• Time Stamps
• Works if timestamp column present
• If column not present, add column
• May not be possible to modify table, so add triggers
• Triggers
• Create trigger for each source table
• Following each DML operation trigger performs updates
• Record DML operations in a log
• Partitioning
• Table range partitioned, say along date key
• Easy to identify new data, say last week’s data
28
CDC in Legacy Systems
• Changes recorded in tapes Changes occurred in legacy transaction
processing are recorded on the log or journal tapes.
• Changes read and removed from tapes Log or journal tape are read
and the update/transaction changes are stripped off for movement
into the data warehouse.
• Problems with reading a log/journal tape are many:
• Contains lot of extraneous data
• Format is often arcane
• Often contains addresses instead of data values and keys
• Sequencing of data in the log tape often has deep and complex implications
• Log tape varies widely from one DBMS to another.
29
CDC Advantages: Modern Systems
Advantages
1. Immediate.
Modern
2. No loss of history
Systems
3. Flat files NOT required
30
CDC Advantages: Legacy Systems
Advantages
1. No incremental on-line I/O required for log tape
2. The log tape captures all update processing
Legacy
3. Log tape processing can be taken off-line. Systems
4. No haste to make waste.
31
Major Transformation Types
Format revision
Decoding of fields
Calculated and derived values
Splitting of single fields
Merging of information
Character set conversion
Unit of measurement conversion
Date/Time conversion
Summarization
Key restructuring
Duplication
32
Major Transformation Types
Format revision
Decoding of fields
Covered in De-Norm
Calculated and derived values
Covered in issues
Splitting of single fields
33
Major Transformation Types
Merging of information
Not really means combining columns to create one column.
Info for product coming from different sources merging it into single entity.
Character set conversion
For PC architecture converting legacy EBCIDIC to ASCII
Unit of measurement conversion
For companies with global branches Km vs. mile or lb vs Kg
Date/Time conversion
November 14, 2005 as 11/14/2005 in US and 14/11/2005 in the British format.
This date may be standardized to be written as 14 NOV 2005.
34
Major Transformation Types
Aggregation & Summarization
Adding
How they are different? like values
Summarization with calculation across business dimension is
aggregation. Example Monthly compensation = monthly sale + bonus
Why both are required?
Grain mismatch (don’t require, don’t have space)
Data Marts requiring low detail
Detail losing its utility
35
Major Transformation Types
Key restructuring (inherent meaning at source)
92 42 4979 234
Country_Code City_Code Post_Code Product_Code
i.e. 92424979234 changed to 12345678
Removing duplication
Incorrect or missing value
Inconsistent naming convention ONE vs 1
Incomplete information
Physically moved, but address not changed
Misspelling or falsification of names 36
Data content defects
• Domain value redundancy
Non-standard data formats
Non-atomic data values
Multipurpose data fields
Embedded meanings
Inconsistent data values
Data quality contamination
37
Data content defects Examples
Domain value redundancy
Unit of Measure
Dozen, Doz., Dz., 12
Non-standard data formats
Phone Numbers
1234567 or 123.456.7
Non-atomic data fields
Name & Addresses
Dr. Nguyen Van A, PhD
38
Data content defects Examples
Embedded Meanings
RC, AP, RJ
received, approved, rejected
39
Background
• Other names: Called as data scrubbing or cleaning.
• More than data arranging: DWH is NOT just about arranging data,
but should be clean for overall health of organization. We drink
clean water!
• Big problem, big effect: Enormous problem, as most data is dirty.
GIGO
• Dirty is relative: Dirty means does not confirm to proper domain
definition and vary from domain to domain.
• Paradox: Must involve domain expert, as detailed domain
knowledge is required, so it becomes semi-automatic, but has to
be automatic because of large data sets.
• Data duplication: Original problem was removing duplicates in
one system, compounded by duplicates from many systems.
40
Lighter Side of Dirty Data
• Year of birth 1995 current year 2005
{Comment: Show picture of baby}
• Born in 1986 hired in 1985
• Who would take it seriously? Computers while
summarizing, aggregating, populating etc.
• Small discrepancies become irrelevant for large
averages, but what about sums, medians, maximum,
minimum etc.?
41
Serious Side of dirty data
• Decision making at the Government level on
investment based on rate of birth in terms of schools
and then teachers. Wrong data resulting in over and
under investment.
• Direct mail marketing sending letters to wrong
addresses retuned, or multiple letters to same
address, loss of money and bad reputation and wrong
identification of marketing region.
42
3 Classes of Anomalies…
▣ Syntactically Dirty Data
◾Lexical Errors
◾Irregularities
▣ Semantically Dirty Data
◾Integrity Constraint Violation
◾Business rule contradiction
◾Duplication
▣ Coverage Anomalies
◾Missing Attributes
◾Missing Records
43
3 Classes of Anomalies…
▣ Syntactically Dirty Data
◾Lexical Errors
◾Discrepancies between the structure of the data items
and the specified format of stored values
◾e.g. number of columns used are unexpected for a tuple
(mixed up number of attributes)
◾Irregularities
◾Non uniform use of units and values, such as only giving
annual salary but without info i.e. in US$ or PK Rs?
▣ Semantically Dirty Data
◾Integrity Constraint violation
◾Contradiction
DoB > Hiring date etc.
◾Duplication
44
3 Classes of Anomalies…
•Coverage Anomalies
• Missing Attribute
Result of omissions while collecting the data.
A constraint violation if we have null values for
attributes where NOT NULL constraint exists.
Case more complicated where no such
constraint exists.
Have to decide whether the value exists in the real
world and has to be deduced here or not.
45
Coverage Anomalies
• Equipment malfunction (bar code reader, keyboard
etc.)
• Inconsistent with other recorded data and thus
deleted.
• Data not entered due to misunderstanding/illegibility.
• Data not considered important at the time of entry
(e.g. Y2K).
46
Handling missing data
• Dropping records.
• “Manually” filling missing values.
• Using a global constant as filler.
• Using the attribute mean (or median) as filler.
• Using the most probable value as filler.
47
Key Based Classification of Problems
• Primary key problems
• Non-Primary key problems
48
Primary key problems
◾ Same PK but different data.
◾ Same entity with different keys.
◾ PK in one system but not in other.
◾ Same PK but in different formats.
49
Non primary key problems…
• Different encoding in different sources.
• Multiple ways to represent the same
information.
• Sources might contain invalid data.
• Two fields with different data but same name.
50
Non primary key problems
• Required fields left blank.
• Data erroneous or incomplete.
• Data contains null values.
51
Automatic Data Cleansing…
1.Statistical
2.Pattern Based
3.Clustering
4.Association Rules
52