0% found this document useful (0 votes)

15 views11 pages

Bases de Dados e Armazéns de Dados: Bibliography

Uploaded by

Eduardo Coelho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views11 pages

Bases de Dados e Armazéns de Dados: Bibliography

Uploaded by

Eduardo Coelho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Bases de Dados e

Armazéns de Dados

Departamento de Engenharia Informática (DEI/ISEP)

Paulo Oliveira
pjo@isep.ipp.pt

Bibliography
 The Data Warehouse ETL Toolkit: Practical Techniques for
Extracting, Cleaning, Conforming and Delivering Data
Ralph Kimball, Joe Caserta
Wiley, 2004
Chapters 3, 4, 5, and 6

1
Extraction, Transformation,
and Loading (ETL)

Extraction

2
Data Extraction
 First step in the process of getting data into the
DW environment

 Means reading and understanding the source

data and copying the data needed in the DW into
the staging area for further manipulation

 Often performed by custom routines – not

recommended because of:
- High program maintenance
- No automatically generated metadata

 Increasingly performed by specialized ETL software

- Provides simpler, faster and cheaper development
5

Data Extraction
 ETL process needs to integrate systems having different:
- Database management systems
- Operating systems
- Hardware

 Necessary to build a logical data map that documents the

relationship between original source attributes and final
destination attributes
- Identify data sources
- Analyse source systems with a data profiling tool (data quality)
- Data lineage and business rules
- Validate calculations and formulas

3
Components of the Logical Data Map
 Logical data map is presented in a table that includes:
- Target table, target attribute and table type (dimension or fact)

- Slowly Changing Dimension (SCD) type per target attribute:

 Type 1 – overwrite (e.g., customer first name)
 Type 2 – retain history (e.g., customer city)

- Source database, source table(s) and source attribute(s)

- Transformation
 Performed manipulation annotated in SQL or pseudo-code

Logical Data Map

4
Analysis of the Source System
 ER model of the system
 Reverse engineering by looking at
metadata of the source system to understand
it
-Unique identifiers and natural keys
-Data types
-Relationships between tables: 1-to-1; 1-to-many;
many-to-many
 Problematic when source database does not have foreign
keys defined
 Discrete relationships (reference tables)

Integrating Data From Different Sources

 It is very important to determine the system-of-
record – originating source of data
- In most enterprises, data is stored across many different
systems

 When a dimension is populated by several distinct

systems, it is important to store:
- The source system from which the data comes
- The unique identifier (primary key) from that source
system

 Identifiers should be viewable by end-users to

ensure that the dimension reflects their data, and
they can tie back to their operational system

5
Two Generic Types of Data Extracts
 Static extract is a method of capturing a
snapshot of all the source data at a point
in time
-Used to fill the DW initially

 Incremental extract captures only the

changes that have occurred in the source
data since last capture

-Used for ongoing DW updates

Incremental Extract
 Retrieve only the records from the source that were
inserted or modified since last extraction

 Audit columns usually populated by the front-end

application or via database triggers fired
automatically as records are inserted or updated
- Create date/time-stamp
- Last update date/time-stamp

 Database logs
- Only images that are logged after the last data extraction are
selected from the log to identify new and changed records

6
Static Extract – Worst Case Scenario
 Source system does not notify changes and does not
have date/time-stamp on its own inserts/updates

 For small data tables, use a brute force approach for

comparing every incoming attribute with every
attribute in the DW to see if anything changed
- Bring today’s entire data in the staging area
- Perform a comparison with the data in the DW
- Insert or update the new or changed data in the DW
- Inefficient but the most reliable

 For larger tables, use the Cyclic-Redundancy

Checksum (CRC) approach
13

Static Extract – CRC Approach

 Procedure
- Treat the entire incoming record as a string
- Compute the CRC value of the string
 Numeric value of about 20 digits
- Compare the CRC value of new record with the CRC value of
existing record
- If the CRC values match
New record is equal to existing record
- If the CRC values do not match
Do a field-by-field comparison to see what has changed
Depending on whether the changed field is a type-1 or type-2
change, do the necessary updates

 Some ETL packages include CRC computation

7
Transformation

Data Cleaning and Conforming

 Data cleaning means identifying and correcting
errors in data
- Misspellings
- Domain violations
- Missing values
- Duplicate records
- Business rules violations

 Data conforming means resolving the conflicts

between incompatible data sources so that they can
be used together
- Requires an enterprise-wide agreement to use standardized
domains and measures
16

8
What Causes Poor Data Quality?
 There are no standards for data capture
 Standards may exist but are not enforced at the point of
data capture
 Inconsistent data entry occurs (use of nicknames or aliases)
 Data entry mistakes happen (character transposition,
misspellings, and so on)
 Integration of data from different systems with different
data quality standards

Data quality problems are perceived as

time-consuming and expensive to fix

Data Cleaning
 Source systems contain “dirty data” that must be
cleaned

 ETL software contains rudimentary data cleaning

capabilities

 Specialized data cleaning software is often used

 Steps in data cleaning

- Parsing
- Correcting
- Standardizing
- Matching
- Consolidating
18

9
Data Cleaning Steps
 Parsing
- Locates and identifies individual data elements in the source
attributes and then isolate these data elements in the targets
- Examples
 Parsing into first, middle, and last name
 Parsing into street number and street name
 Parsing into zip code and city

 Correcting
- Corrects parsed individual values using data algorithms and
secondary data sources
- Example
 Correct an address by adding a zip code

Data Cleaning Steps

 Standardizing
- Applies conversion rules to transform data into its preferred
and consistent format
- Example: Replacing an acronym, replacing an abbreviation

 Matching
- Searching and matching records within and across the
parsed, corrected and standardized database on predefined
detection rules to identify duplicates
- Example: Identifying similar names and addresses

 Consolidating
- Analyzing and identifying relationships between matched
records and merging them into one representation
20

10
Data Cleaning Example
Operational
Data Warehouse
Systems
Mark Carver
SAS
SAS Campus Drive
Cary, N.C.
01 Mark W. Craver
Systems Engineer
SAS
Mark W. Craver
DQ SAS Campus Drive
27513 Cary, N.C.
Mark.Craver@sas.com 27513
Mark.Craver@sas.com

Mark Craver
Systems Engineer
SAS

Data Warehousing: Lecture No 07
No ratings yet
Data Warehousing: Lecture No 07
38 pages
Presentation 2
No ratings yet
Presentation 2
22 pages
06-Data-Integration Quality Profiling
No ratings yet
06-Data-Integration Quality Profiling
39 pages
E Xtract T Ransform L OAD: MIS Systems (Acct, HR) Legacy Systems
No ratings yet
E Xtract T Ransform L OAD: MIS Systems (Acct, HR) Legacy Systems
30 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Integrasi Data Dan ETL
No ratings yet
Integrasi Data Dan ETL
45 pages
ETL Process for Data Warehousing
100% (1)
ETL Process for Data Warehousing
52 pages
ETL
No ratings yet
ETL
32 pages
ETL Basics for Data Warehousing
No ratings yet
ETL Basics for Data Warehousing
27 pages
Extract Transform Load Cycle
No ratings yet
Extract Transform Load Cycle
32 pages
ETL Process in Data Warehousing
No ratings yet
ETL Process in Data Warehousing
29 pages
Basics of Data Integration
100% (1)
Basics of Data Integration
61 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
ETL Process in Data Warehouse
67% (3)
ETL Process in Data Warehouse
40 pages
ETL Process for Data Experts
No ratings yet
ETL Process for Data Experts
40 pages
Module 3
No ratings yet
Module 3
30 pages
Outline: ETL Extraction Transformation Loading
No ratings yet
Outline: ETL Extraction Transformation Loading
38 pages
Cs614 (Final)
No ratings yet
Cs614 (Final)
29 pages
04 - ETL Process
No ratings yet
04 - ETL Process
40 pages
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
No ratings yet
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
72 pages
ETL Process in Data Warehouse: Click To Add Text Chirayu Poundarik
No ratings yet
ETL Process in Data Warehouse: Click To Add Text Chirayu Poundarik
37 pages
Unit III DWM
No ratings yet
Unit III DWM
13 pages
Imran Introduction To DWH-5
No ratings yet
Imran Introduction To DWH-5
26 pages
ETL Process and Best Practices
No ratings yet
ETL Process and Best Practices
31 pages
Lecture 16
No ratings yet
Lecture 16
21 pages
ETL Process in Data Warehouse: Chirayu Poundarik
No ratings yet
ETL Process in Data Warehouse: Chirayu Poundarik
40 pages
Lect#4
No ratings yet
Lect#4
27 pages
Data Warehousing: Modern Database Management
No ratings yet
Data Warehousing: Modern Database Management
32 pages
Data Quality for Business Efficiency
No ratings yet
Data Quality for Business Efficiency
64 pages
Why ETL
No ratings yet
Why ETL
15 pages
ETL Basic Concepts
No ratings yet
ETL Basic Concepts
63 pages
Clover ETL - 1
No ratings yet
Clover ETL - 1
29 pages
Session5 6 Etl
No ratings yet
Session5 6 Etl
22 pages
Unit 3
No ratings yet
Unit 3
33 pages
Lecture 7 (17-04-2024)
No ratings yet
Lecture 7 (17-04-2024)
29 pages
Advanced ETL Techniques for DWH
No ratings yet
Advanced ETL Techniques for DWH
46 pages
ETL Concepts
100% (3)
ETL Concepts
56 pages
Bi Unit 3
No ratings yet
Bi Unit 3
26 pages
Chapter IV
No ratings yet
Chapter IV
22 pages
Recent in Data Warehousing: Developments
No ratings yet
Recent in Data Warehousing: Developments
62 pages
ETL Review
No ratings yet
ETL Review
30 pages
ETL Testing
No ratings yet
ETL Testing
12 pages
ETL Basics
No ratings yet
ETL Basics
6 pages
All Bi
No ratings yet
All Bi
17 pages
Unit 3-1
No ratings yet
Unit 3-1
19 pages
Data Warehouse Slide3
No ratings yet
Data Warehouse Slide3
43 pages
DWH and Testing1
No ratings yet
DWH and Testing1
11 pages
Lecture 17
No ratings yet
Lecture 17
35 pages
ETL Process in Data Warehouse
No ratings yet
ETL Process in Data Warehouse
26 pages
Ass 1
No ratings yet
Ass 1
31 pages
Data Extraction Part1
No ratings yet
Data Extraction Part1
15 pages
ETL Process for Data Warehousing
No ratings yet
ETL Process for Data Warehousing
4 pages
The Process of Updating The Data Warehouse
No ratings yet
The Process of Updating The Data Warehouse
24 pages
BI Architecture Essentials
No ratings yet
BI Architecture Essentials
4 pages
Financial Ratios for Investors
No ratings yet
Financial Ratios for Investors
8 pages
Techno Fashion
100% (2)
Techno Fashion
22 pages
Design Technologist, Brand Innovation Lab - Job ID - 2743288 - Amazon - Jobs
No ratings yet
Design Technologist, Brand Innovation Lab - Job ID - 2743288 - Amazon - Jobs
2 pages
SIMATIC PCS 7 CPU 410-5H Guide
No ratings yet
SIMATIC PCS 7 CPU 410-5H Guide
60 pages
Wayside Amenities Guidelines
No ratings yet
Wayside Amenities Guidelines
8 pages
6 Harold VS Aliba
No ratings yet
6 Harold VS Aliba
4 pages
Grade 7 Computer Curriculum Map
100% (2)
Grade 7 Computer Curriculum Map
2 pages
Disc Brake Rotor Thermal Analysis
No ratings yet
Disc Brake Rotor Thermal Analysis
12 pages
Gunship! Tactics
No ratings yet
Gunship! Tactics
8 pages
Spreadsheet Notes For SHS
50% (8)
Spreadsheet Notes For SHS
25 pages
Oscar B. Pimentel
No ratings yet
Oscar B. Pimentel
84 pages
History of World Wide Web
No ratings yet
History of World Wide Web
2 pages
Assessment of Axial Flux Motor Technology For Hybrid Powertrain Integration
No ratings yet
Assessment of Axial Flux Motor Technology For Hybrid Powertrain Integration
8 pages
MANN - FILTER - Katalog Reklamnog Materijala - 2020 - E 14
No ratings yet
MANN - FILTER - Katalog Reklamnog Materijala - 2020 - E 14
1 page
S I R S: School Improvement Through Teacher Decision Making
No ratings yet
S I R S: School Improvement Through Teacher Decision Making
7 pages
Public Liability Insurance Guide
No ratings yet
Public Liability Insurance Guide
8 pages
Training Evaluation Self-Assessment
No ratings yet
Training Evaluation Self-Assessment
3 pages
Technology Blogs by SAP
No ratings yet
Technology Blogs by SAP
40 pages
01 Production Manual - CM602
No ratings yet
01 Production Manual - CM602
56 pages
Preboard Angelus Victus Test 5
No ratings yet
Preboard Angelus Victus Test 5
3 pages
Fuel Transfer/Boost Pump Check Valves: List of Gures
No ratings yet
Fuel Transfer/Boost Pump Check Valves: List of Gures
3 pages
Image Processing With VHDL PDF
No ratings yet
Image Processing With VHDL PDF
131 pages
Registration Form
No ratings yet
Registration Form
4 pages
Democratising Digital Commerce in India April 2023
No ratings yet
Democratising Digital Commerce in India April 2023
160 pages
MiPower Software Intro Lab Guide
100% (1)
MiPower Software Intro Lab Guide
9 pages
Optoelectronics LASER 28jan23 7feb23 12feb23
No ratings yet
Optoelectronics LASER 28jan23 7feb23 12feb23
172 pages
Pertemuan XV Pengertian Dan Prinsip Kerja Jfet Lanjutan
No ratings yet
Pertemuan XV Pengertian Dan Prinsip Kerja Jfet Lanjutan
31 pages
Guidelines For GNSS Positioning in The Oil and Gas Industry: February
No ratings yet
Guidelines For GNSS Positioning in The Oil and Gas Industry: February
91 pages
TRAPeze Manual S7700
No ratings yet
TRAPeze Manual S7700
44 pages
SAP MRP - Material Requirement Planning Overview PDF
No ratings yet
SAP MRP - Material Requirement Planning Overview PDF
20 pages

Bases de Dados e Armazéns de Dados: Bibliography

Uploaded by

Bases de Dados e Armazéns de Dados: Bibliography

Uploaded by

Bases de Dados e

Departamento de Engenharia Informática (DEI/ISEP)

 Means reading and understanding the source

 Often performed by custom routines – not

 Increasingly performed by specialized ETL software

 Necessary to build a logical data map that documents the

- Slowly Changing Dimension (SCD) type per target attribute:

- Source database, source table(s) and source attribute(s)

Logical Data Map

Integrating Data From Different Sources

 When a dimension is populated by several distinct

 Identifiers should be viewable by end-users to

 Incremental extract captures only the

-Used for ongoing DW updates

 Audit columns usually populated by the front-end

 For small data tables, use a brute force approach for

 For larger tables, use the Cyclic-Redundancy

Static Extract – CRC Approach

 Some ETL packages include CRC computation

Data Cleaning and Conforming

 Data conforming means resolving the conflicts

Data quality problems are perceived as

 ETL software contains rudimentary data cleaning

 Specialized data cleaning software is often used

 Steps in data cleaning

Data Cleaning Steps

You might also like