0% found this document useful (0 votes)
15 views11 pages

Bases de Dados e Armazéns de Dados: Bibliography

Uploaded by

Eduardo Coelho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views11 pages

Bases de Dados e Armazéns de Dados: Bibliography

Uploaded by

Eduardo Coelho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Bases de Dados e

Armazéns de Dados

Departamento de Engenharia Informática (DEI/ISEP)


Paulo Oliveira
pjo@isep.ipp.pt

Bibliography
 The Data Warehouse ETL Toolkit: Practical Techniques for
Extracting, Cleaning, Conforming and Delivering Data
Ralph Kimball, Joe Caserta
Wiley, 2004
Chapters 3, 4, 5, and 6

1
Extraction, Transformation,
and Loading (ETL)

Extraction

2
Data Extraction
 First step in the process of getting data into the
DW environment

 Means reading and understanding the source


data and copying the data needed in the DW into
the staging area for further manipulation

 Often performed by custom routines – not


recommended because of:
- High program maintenance
- No automatically generated metadata

 Increasingly performed by specialized ETL software


- Provides simpler, faster and cheaper development
5

Data Extraction
 ETL process needs to integrate systems having different:
- Database management systems
- Operating systems
- Hardware

 Necessary to build a logical data map that documents the


relationship between original source attributes and final
destination attributes
- Identify data sources
- Analyse source systems with a data profiling tool (data quality)
- Data lineage and business rules
- Validate calculations and formulas

3
Components of the Logical Data Map
 Logical data map is presented in a table that includes:
- Target table, target attribute and table type (dimension or fact)

- Slowly Changing Dimension (SCD) type per target attribute:


 Type 1 – overwrite (e.g., customer first name)
 Type 2 – retain history (e.g., customer city)

- Source database, source table(s) and source attribute(s)

- Transformation
 Performed manipulation annotated in SQL or pseudo-code

Logical Data Map

4
Analysis of the Source System
 ER model of the system
 Reverse engineering by looking at
metadata of the source system to understand
it
-Unique identifiers and natural keys
-Data types
-Relationships between tables: 1-to-1; 1-to-many;
many-to-many
 Problematic when source database does not have foreign
keys defined
 Discrete relationships (reference tables)

Integrating Data From Different Sources


 It is very important to determine the system-of-
record – originating source of data
- In most enterprises, data is stored across many different
systems

 When a dimension is populated by several distinct


systems, it is important to store:
- The source system from which the data comes
- The unique identifier (primary key) from that source
system

 Identifiers should be viewable by end-users to


ensure that the dimension reflects their data, and
they can tie back to their operational system

10

10

5
Two Generic Types of Data Extracts
 Static extract is a method of capturing a
snapshot of all the source data at a point
in time
-Used to fill the DW initially

 Incremental extract captures only the


changes that have occurred in the source
data since last capture

-Used for ongoing DW updates

11

11

Incremental Extract
 Retrieve only the records from the source that were
inserted or modified since last extraction

 Audit columns usually populated by the front-end


application or via database triggers fired
automatically as records are inserted or updated
- Create date/time-stamp
- Last update date/time-stamp

 Database logs
- Only images that are logged after the last data extraction are
selected from the log to identify new and changed records

12

12

6
Static Extract – Worst Case Scenario
 Source system does not notify changes and does not
have date/time-stamp on its own inserts/updates

 For small data tables, use a brute force approach for


comparing every incoming attribute with every
attribute in the DW to see if anything changed
- Bring today’s entire data in the staging area
- Perform a comparison with the data in the DW
- Insert or update the new or changed data in the DW
- Inefficient but the most reliable

 For larger tables, use the Cyclic-Redundancy


Checksum (CRC) approach
13

13

Static Extract – CRC Approach


 Procedure
- Treat the entire incoming record as a string
- Compute the CRC value of the string
 Numeric value of about 20 digits
- Compare the CRC value of new record with the CRC value of
existing record
- If the CRC values match
New record is equal to existing record
- If the CRC values do not match
Do a field-by-field comparison to see what has changed
Depending on whether the changed field is a type-1 or type-2
change, do the necessary updates

 Some ETL packages include CRC computation

14

14

7
Transformation

15

Data Cleaning and Conforming


 Data cleaning means identifying and correcting
errors in data
- Misspellings
- Domain violations
- Missing values
- Duplicate records
- Business rules violations

 Data conforming means resolving the conflicts


between incompatible data sources so that they can
be used together
- Requires an enterprise-wide agreement to use standardized
domains and measures
16

16

8
What Causes Poor Data Quality?
 There are no standards for data capture
 Standards may exist but are not enforced at the point of
data capture
 Inconsistent data entry occurs (use of nicknames or aliases)
 Data entry mistakes happen (character transposition,
misspellings, and so on)
 Integration of data from different systems with different
data quality standards

Data quality problems are perceived as


time-consuming and expensive to fix

17

17

Data Cleaning
 Source systems contain “dirty data” that must be
cleaned

 ETL software contains rudimentary data cleaning


capabilities

 Specialized data cleaning software is often used

 Steps in data cleaning


- Parsing
- Correcting
- Standardizing
- Matching
- Consolidating
18

18

9
Data Cleaning Steps
 Parsing
- Locates and identifies individual data elements in the source
attributes and then isolate these data elements in the targets
- Examples
 Parsing into first, middle, and last name
 Parsing into street number and street name
 Parsing into zip code and city

 Correcting
- Corrects parsed individual values using data algorithms and
secondary data sources
- Example
 Correct an address by adding a zip code

19

19

Data Cleaning Steps


 Standardizing
- Applies conversion rules to transform data into its preferred
and consistent format
- Example: Replacing an acronym, replacing an abbreviation

 Matching
- Searching and matching records within and across the
parsed, corrected and standardized database on predefined
detection rules to identify duplicates
- Example: Identifying similar names and addresses

 Consolidating
- Analyzing and identifying relationships between matched
records and merging them into one representation
20

20

10
Data Cleaning Example
Operational
Data Warehouse
Systems
Mark Carver
SAS
SAS Campus Drive
Cary, N.C.
01 Mark W. Craver
Systems Engineer
SAS
Mark W. Craver
DQ SAS Campus Drive
27513 Cary, N.C.
Mark.Craver@sas.com 27513
Mark.Craver@sas.com

Mark Craver
Systems Engineer
SAS

21

21

11

You might also like