Bases de Dados e
Armazéns de Dados
Departamento de Engenharia Informática (DEI/ISEP)
Paulo Oliveira
pjo@isep.ipp.pt
Bibliography
The Data Warehouse ETL Toolkit: Practical Techniques for
Extracting, Cleaning, Conforming and Delivering Data
Ralph Kimball, Joe Caserta
Wiley, 2004
Chapters 3, 4, 5, and 6
1
Extraction, Transformation,
and Loading (ETL)
Extraction
2
Data Extraction
First step in the process of getting data into the
DW environment
Means reading and understanding the source
data and copying the data needed in the DW into
the staging area for further manipulation
Often performed by custom routines – not
recommended because of:
- High program maintenance
- No automatically generated metadata
Increasingly performed by specialized ETL software
- Provides simpler, faster and cheaper development
5
Data Extraction
ETL process needs to integrate systems having different:
- Database management systems
- Operating systems
- Hardware
Necessary to build a logical data map that documents the
relationship between original source attributes and final
destination attributes
- Identify data sources
- Analyse source systems with a data profiling tool (data quality)
- Data lineage and business rules
- Validate calculations and formulas
3
Components of the Logical Data Map
Logical data map is presented in a table that includes:
- Target table, target attribute and table type (dimension or fact)
- Slowly Changing Dimension (SCD) type per target attribute:
Type 1 – overwrite (e.g., customer first name)
Type 2 – retain history (e.g., customer city)
- Source database, source table(s) and source attribute(s)
- Transformation
Performed manipulation annotated in SQL or pseudo-code
Logical Data Map
4
Analysis of the Source System
ER model of the system
Reverse engineering by looking at
metadata of the source system to understand
it
-Unique identifiers and natural keys
-Data types
-Relationships between tables: 1-to-1; 1-to-many;
many-to-many
Problematic when source database does not have foreign
keys defined
Discrete relationships (reference tables)
Integrating Data From Different Sources
It is very important to determine the system-of-
record – originating source of data
- In most enterprises, data is stored across many different
systems
When a dimension is populated by several distinct
systems, it is important to store:
- The source system from which the data comes
- The unique identifier (primary key) from that source
system
Identifiers should be viewable by end-users to
ensure that the dimension reflects their data, and
they can tie back to their operational system
10
10
5
Two Generic Types of Data Extracts
Static extract is a method of capturing a
snapshot of all the source data at a point
in time
-Used to fill the DW initially
Incremental extract captures only the
changes that have occurred in the source
data since last capture
-Used for ongoing DW updates
11
11
Incremental Extract
Retrieve only the records from the source that were
inserted or modified since last extraction
Audit columns usually populated by the front-end
application or via database triggers fired
automatically as records are inserted or updated
- Create date/time-stamp
- Last update date/time-stamp
Database logs
- Only images that are logged after the last data extraction are
selected from the log to identify new and changed records
12
12
6
Static Extract – Worst Case Scenario
Source system does not notify changes and does not
have date/time-stamp on its own inserts/updates
For small data tables, use a brute force approach for
comparing every incoming attribute with every
attribute in the DW to see if anything changed
- Bring today’s entire data in the staging area
- Perform a comparison with the data in the DW
- Insert or update the new or changed data in the DW
- Inefficient but the most reliable
For larger tables, use the Cyclic-Redundancy
Checksum (CRC) approach
13
13
Static Extract – CRC Approach
Procedure
- Treat the entire incoming record as a string
- Compute the CRC value of the string
Numeric value of about 20 digits
- Compare the CRC value of new record with the CRC value of
existing record
- If the CRC values match
New record is equal to existing record
- If the CRC values do not match
Do a field-by-field comparison to see what has changed
Depending on whether the changed field is a type-1 or type-2
change, do the necessary updates
Some ETL packages include CRC computation
14
14
7
Transformation
15
Data Cleaning and Conforming
Data cleaning means identifying and correcting
errors in data
- Misspellings
- Domain violations
- Missing values
- Duplicate records
- Business rules violations
Data conforming means resolving the conflicts
between incompatible data sources so that they can
be used together
- Requires an enterprise-wide agreement to use standardized
domains and measures
16
16
8
What Causes Poor Data Quality?
There are no standards for data capture
Standards may exist but are not enforced at the point of
data capture
Inconsistent data entry occurs (use of nicknames or aliases)
Data entry mistakes happen (character transposition,
misspellings, and so on)
Integration of data from different systems with different
data quality standards
Data quality problems are perceived as
time-consuming and expensive to fix
17
17
Data Cleaning
Source systems contain “dirty data” that must be
cleaned
ETL software contains rudimentary data cleaning
capabilities
Specialized data cleaning software is often used
Steps in data cleaning
- Parsing
- Correcting
- Standardizing
- Matching
- Consolidating
18
18
9
Data Cleaning Steps
Parsing
- Locates and identifies individual data elements in the source
attributes and then isolate these data elements in the targets
- Examples
Parsing into first, middle, and last name
Parsing into street number and street name
Parsing into zip code and city
Correcting
- Corrects parsed individual values using data algorithms and
secondary data sources
- Example
Correct an address by adding a zip code
19
19
Data Cleaning Steps
Standardizing
- Applies conversion rules to transform data into its preferred
and consistent format
- Example: Replacing an acronym, replacing an abbreviation
Matching
- Searching and matching records within and across the
parsed, corrected and standardized database on predefined
detection rules to identify duplicates
- Example: Identifying similar names and addresses
Consolidating
- Analyzing and identifying relationships between matched
records and merging them into one representation
20
20
10
Data Cleaning Example
Operational
Data Warehouse
Systems
Mark Carver
SAS
SAS Campus Drive
Cary, N.C.
01 Mark W. Craver
Systems Engineer
SAS
Mark W. Craver
DQ SAS Campus Drive
27513 Cary, N.C.
Mark.Craver@sas.com 27513
Mark.Craver@sas.com
Mark Craver
Systems Engineer
SAS
21
21
11