Introduction to ETL
SESSION- 10
ETL – Extract, Transform, Load
ETL is the process of extracting the data from various sources, transforming this data to meet your
requirement and then loading it into a target data warehouse.
What is ETL?
OLTP cant be used for analytical purpose, a separate system needed called OLAP.
There is a source (OLTP system) and there is a target (OLAP system), now how do we
transfer the data from one system to another?
The set of methods is ETL process.
Now there are 2 systems, why not directly copy the data from one system to another? It
becomes a easy process
But, the answer is there the users will not be interested in all of the data that is recorded by
the transactional system. OLAP is a system utilized for strategy and decision making.
What is ETL?
DWH is subject oriented.
An OLTP system will contain data for employees of an organization as well as it contains data
about the customers of an organization.
But the organization would need two different OLAP systems based on the above subjects, i.e.
employees , customers, hence the data has to be identified form the source system and only
relevant data should be loaded.
SUBJECT:
EMPLOYEE
EMPLOYEE OLAP SYSTEM
ORGANIZATION OLAP
SYSTEM
SUBJECT:
CUSTOMER OLAP SYSTEM CUSTOMER
What is ETL?
One more issue is that some of the data in the OLTP system might not have proper
format.
For example, phone number instead of having 10 digits may have only 3 digits or email
address might not be proper or blank, such data needs a filtered out
Each OLTP system might have different formats. For example Banks might be based on
mainframes, but employee or CRMs can be in SAP ERP systems
So prior loading, homogeneity of data should be maintained
SOURCE
ETL DATAWAREHOUSE
SOURCE
ETL Challenges
The complexity of the ETL design depends on the complexity of the warehouse, the number of
OLTP systems that data has to be extracted from and their complexity, the quality of data that is
present in the OLTP system.
Most of the DWH loading takes place at night, so that the end users can work during the day on
report generation
A proper time slot has to be decided and if there are multiple application running on the server
and the server capacity is small, then it has to be synchronized so that load on the server is not
too much.
Next issue is with the incremental load. Today’s data is already loaded into the DWH, there is no
point in loading the same data tomorrow, only the changes have to be loaded. This is also a
challenging task.
Data duplication is another question. There is no point in loading the same data twice, as it will
take up more space on the DWH without serving any purpose.
Then we come to the load process, millions of rows are loaded into the DWH, which takes hours
for the load process to complete.
Summary of ETL
Determine the data you need and the data you don’t need in your target
Determine from where (Source) the above data is going to come from
Determine the data extraction methods, cleansing rules and transformation rules
Load the dimension data and the fact data
Data Extraction
There are multiple disparate sources in an OLTP system.
Most of the systems especially in the financial services domain are legacy systems or
systems like mainframes that were designed long time ago and keeping with the times the
company may also have the newer systems like CRM or SAP etc.
Connecting directly to mainframe systems is a challenging task
Hence usually in a production environment, data is extracted within the mainframe
system and sent as files
These files are then cleansed and loaded
Data Extraction
The type of data extraction depends on how the data gets stored in the OLTP system
For example consider the address (only state considered for simplicity) of a person
which changes over time:
Person ID 12345
From Jan 1st 2000 to May 1st 2000 – Lived in Canada
From May 2nd 2000 to June 5 2001 – Lived in New york
From June 5 2001 till now – Living in London
Person ID 12346
From May 4 2002 to Sept 4 2002 – California
From Sept 4 2002 till now San Francisco
Data Extraction – Approach 1
Approach 1 is to store only the most current information of each person ID
As can be seen above at a given point of time there is just one record for a person ID.
At the end of every single day, the data is extracted and loaded in to the warehouse.
PERSON ID DATE STATE
12345 JAN 1 2000 CANADA
12346 MAY 4 2002 CALIFORNIA
PERSON ID DATE STATE
12345 MAY 2 2000 NEW YORK
12346 SEPT 4 2002 SAN FRANCISCO
PERSON ID DATE STATE
12345 JUNE 5 2001 LONDON
12346 SEPT 4 2002 SAN FRANCISCO
Data Extraction – Approach 2
Approach 2 would be to store the entire history
For this approach, all changes are extracted since the time of the last extract. This is called
Change Data Capture.
There is usually a timestamp associated with the change that tells us when the change was
performed.
For this to work,, logic has to be added at the sources which places additional burden on
the source.
Hence usually, the entire data is extracted and sent and during loading process it is
compared with the already existing data to check if the data is the same.
If it is the same, it is discarded and if there is a change, then it is entered into the OLAP
system.
PERSON ID Date State
12345 Jan 1 2000 Canada
12345 May 2 2000 New York
12345 June 5 2001 San Francisco
Data Extraction
Usually jobs are scheduled during off peak hours to capture the data from the OLTP
systems and prepare a set of files
Again jobs are scheduled to deliver the files to a particular directory in a particular
server (on which the ETL tool is installed)
ETL process can be made file dependent or time dependent, i.e. the ETL process can
kick off as soon as the complete set of files arrives in the directory or they can kick off
at a predetermined time
But if they are scheduled to kick off at a pre determined time say 3 AM in the morning,
and the source system is down at that time, and have not sent the files, then the ETL
process will run empty and may even fail.
Data Transformation
Data type conversion is also done in the transformation
There will be a mismatch between the data types of sources for a DWH
For example the Name filed might have length of 100 characters or multiple columns
may have been joined to form one column of length 50. Char might get converted
into varchar.
Source
Warehouse
Dd/mm/yy
Dd/mm/yyyy
Data transformation
Joining is one more example of transformation.
Multiple sources may be used to load one particular table of a warehouse.
One particular column may be split into multiple columns or multiple columns may be
joined to form one column.
Example: The first name and last name be joined to form one column called name or
the name column may be split into first name and last name. Address may contain
country, state, city and zip coded together but at the warehouse a business user
might want to generate report based on only the zip code value. Hence the address
filed needs to be read from source and split into individual fields and populated.
Data Transformation
Calculations or aggregations are performed on the source data
For example: The sales data of a supermarket chain may be calculated at a month
level based on individual transactions from the source.
The number of policies an agent has sold per state per month can be calculated. Profit
percentage on a quarterly basis may need to be calculated and stored in the fact
tables. All such calculations may be performed in the transformation stage.
Agent wise sales data
Sales Data
Increasing Profit
or not
Digital Transformation
Business rules are applied on the data
For example there might be a rule saying that one department should have only one
manager. We need to have checks to make sure that if there are two managers
coming for one particular department, then we populate the right one as per rule.
Parent child relationships need to be maintained. For example, a record cannot be
loaded into table B as long as it is not loaded into table A. An update record cannot
go in if a corresponding insert record is not present in the table.
Data Transformation
Decoding of certain values from the source is performed
For example, a company may use codes in source system as 1,2,3, 4 to classify its
employees as Active, Inactive, Retired, Resigned. What do these abbreviations
mean?
A business user who is generating the reports may not know the meaning of these
codes. These codes need ot be read from the source and have to be re-coded
and populated such that it is easy for the business user to understand.
Key values might need to generated for the dimension tables.
Every row in a dimension table needs to be uniquely identified and this key is used
in the fact table to identify the dimension to which it is linked.
Hence keys may need to be generated in sequence and can be done in the
transformation stage.
First the dimension table needs to be read, the maximum key value for that
dimension must be read, and the max value must be incremented by 1 for every
row.
Data Transformation
Conversion may need to be performed
If the company is operating in different geographies, the sources in different
geographies have their own metric system but reporting might be done at the
headquarters and they may have another set of metrics.
Hence the metrics may need to be converted to the one used in the headquarters.
Another commonly converted parameter is the date parameter. Different countries
use different date notations. For example, the US format is MM/DD/YYYY; the UK format
is DD/MM/YYYY. This may need to be standardized to a common format.
Data Loading
Two ways of reading data from the source systems. Read everything for every run, or
read only the incremental changes.
As is obvious the 2nd method is more efficient, let us look now at how incremental load
can be done in a warehouse. Consider the example:
Person ID Date State
1 1st Jan 2000 A$ Sheets
1 1st Jan 2000 Pen
2 1st Jan 2000 Pencils
3 2nd Jan 2000 Stapler
1 2nd Jan 2000 Notepad
4 3rd Jan 2000 Markers
We have a table that details what product was sold to a customer of a stationery shop.
One 1st Jan there were 3 entries made, 2 entries on 2 nd Jan and 1 entry on 3rd Jan. We
can use a query directly in the source system to fetch all records for a particular date.
But what will happen if there were no record on a particular date?
Data Loading
For the 3rd jan record, we will fetch all records with data > 2 nd Jan, but what if 2nd
Jan run itself fails?
As per the above logic, we will be losing 2 records that were entered on 2nd Jan.
Thus, a separate table is created to store the run status, let us call it ‘Control Table’
BATCH_RUN_STATUS. The table will look something like below:
BATCH RUN ID DATE STATUS
1 1st Jan 2000 S
2 2nd Jan 2000 F
3 3rd Jan 2000 D
The last column in the above table gives the status of the run for each status.
1st Jan 2000 run was success hence S is entered, 2 nd Jan 2000 run was failed hence
F is entered, 3rd Jan batch is due to run hence D is entered.
For the 3rd Jan batch runs all record with date > previous, SUCCESSFUL run are
fetched. In the example above all records > 1st Jan 2000 are fetched and loaded.
References
https://www.matillion.com/case-studies/cisco/
https://riskspan.com/etl-solutions/
https://www.geeksforgeeks.org/etl-process-in-data-warehouse/
https://www.javatpoint.com/etl-process-in-data-warehouse
https://www.wisdomjobs.com/e-university/data-warehouse-etl-toolkit-
tutorial-201/designing-the-staging-area-8034.html