Cs614 (Final)
Cs614 (Final)
ETL (Extract, Transform, Load) is a simple process used in Data Warehousing (DWH) to collect and prepare
data for analysis. ETL consists of following three processes:
Extraction
Transformation
Loading
Extraction:
Data extraction is the process of getting data from different sources (Extract).
Types of data extraction:
Data extraction is divided into two types:
Logical extraction
Physical extraction
Logical extraction:
Logical Extraction is when data is taken in a planned or smart way using a query or rules from the source
system.
Example:
Getting only sales data from the last 3 months using a SQL query.
Types of logical extraction:
Logical extraction is further divided into two types:
Full extraction
Incremental extraction
Full extraction:
Full Extraction means getting all the data from the source every time, no matter what has changed.
Example:
Every night, the entire sales table is copied into the data warehouse, even if only a few records changed.
Incremental extraction:
Incremental Extraction means getting only the new or changed data since the last extraction.
Example:
Only the sales records added or updated today are extracted and loaded into the data warehouse.
Physical extraction:
Physical Extraction is when data is taken directly from the source system without much filtering or logic —
often the entire data is copied.
Example:
Copying the whole sales table from a database without any filter.
Types of physical extraction:
Physical extraction can be further divided into following types:
Online extraction
Offline extraction
Online extraction:
Online Extraction means getting data directly from a live source system (like a running database) in real time or
near real time.
Example:
Getting sales data directly from the store's database while it is running.
Offline extraction:
Offline Extraction means getting data from a backup file or export file, not directly from the live system.
Example:
Copying data from a CSV or Excel file that was exported from the system earlier.
Transformation:
Data Transformation means changing the data into the right format, structure, or values so it can be stored and
used properly in the Data Warehouse.
Example:
Changing date format from DD-MM-YYYY to YYYY-MM-DD
Replacing "F" and "M" with "Female" and "Male"
Combining first name and last name into full name
Data transformation basic tasks:
Following are some basic tasks that are used in data transformation:
Selection
Splitting/Joining
Conversion
Summarization
Enrichment
Selection:
Picking only the needed data from a large dataset.
Example:
Choosing only Name and Email columns from a customer table (ignoring the rest).
Splitting/Joining:
Splitting: Breaking one column into multiple parts.
Joining: Combining data from two or more columns or tables.
Examples:
Splitting: Breaking “Full Name” into “First Name” and “Last Name”.
Joining: Combining customer data with order data to see what each person bought.
Conversion:
Changing data format or type to make it uniform.
Example:
Converting a text date like “19-06-2025” to a proper date format.
Changing numbers stored as text into actual numeric data.
Summarization:
Creating summary data from detailed records.
Example:
Calculating total sales per month
Finding the average score of students
Enrichment:
Adding more useful information to the data.
Example:
Adding a city name based on a zip code
Adding a customer loyalty level based on purchase history
Loading:
Data Loading is the process of moving the transformed (cleaned and formatted) data into the Data Warehouse so
it can be stored and used for reporting and analysis.
Once data is extracted and transformed, the final step is loading it into the DWH using different strategies
depending on:
How fresh the data needs to be,
How often the data changes,
How much storage and speed you have.
Data Loading Techniques:
There are 3 main data loading techniques:
Full data refresh
Incremental Data refresh
Trickle/continues feed
Full data refresh:
Replace everything in the table with the latest full data.
Also called Block Insert or Block Slamming
Used mostly during initial loading or when the data changes completely
Needs more time and storage
Example:
Re-loading the entire customer database, even if only 10% of it changed.
Incremental Data refresh:
Load only the new or changed data into the existing table.
Faster and more efficient than full refresh
Uses less storage
Common for daily or hourly updates
Example:
Only loading sales records added or updated since yesterday.
Trickle/continues feed:
Load data continuously (in real-time or near real-time), as it arrives.
Very fresh data available at all times
Uses row-level inserts/updates
High system load and complexity
Example:
A bank loading each transaction instantly after it happens.
CDC:
CDC stands for change data capture which is a technique that is used in ETL that quickly identifies and process
only that data that has been change not entire table and makes the change data available for the further use.
Finals
Web scrapping:
Web Scraping is the process of automatically collecting information from websites. It’s like a robot reading a
website and picking out the useful data, such as prices, names, news, or reviews.
Data cleansing:
Data Cleansing (or Scrubbing) is the process of detecting and correcting dirty data before loading it into the
Data Warehouse.
Dirty data:
Dirty Data means incorrect, incomplete, or duplicate data stored in a database or data warehouse that can cause
problems in reports, analysis, or decisions.
Example:
A person born in 1995 but hired in 1985 → Invalid data
Same customer listed twice with slightly different names → Duplicate
Importance of data cleansing:
Data cleansing improves the quality of data in the warehouse.
It ensures correct business decisions based on clean and valid data.
Dirty data leads to bad reports, wasted resources, and wrong conclusions.
Challenges in Data Cleansing
Most real-world data is dirty (GIGO – Garbage In, Garbage Out).
Needs domain experts for rules but must be semi-automatic due to large data.
Dirty is relative: What’s incorrect in one system may be valid in another.
Data anomalies:
Data anomalies are errors, inconsistencies, or unexpected problems in the data that make it incorrect,
incomplete, or misleading.
Types of data anomalies:
Data anomalies can be divided into following three types:
Syntactically dirty data
Semantically dirty data
Converge anomalies
Syntactically dirty data:
These are format or structure errors in the data (how the data looks).
Types of syntactically dirty data:
Lexical error
Irregularities
Lexical error:
Data does not match the expected structure or format.
Example:
A row has only 4 columns, but the table expects 5 columns.
Wrong: Ali, 25, Lahore, Male
Correct: Ali, 25, Lahore, Male, Engineer
Irregularities:
Non-uniform use of values, units, or abbreviations.
Example:
Salary is written as 20,000 in one place (assumed to be Rs.)
and $20,000 in another place (US dollars).
Semantically dirty data:
These are logical or meaning-related errors in data.
Types of semantically dirty data:
Integrity constraint violation
Business rule contradiction
Duplication
Integrity constraint violation:
Data breaks rules set by the database (like unique ID, valid date).
Example:
A person is born in 1990 but hired in 1985.
Business rule contradiction:
Data goes against business logic or company rules.
Example:
A student has grade A but 0% attendance – this breaks school rules.
Duplication:
Same record appears more than once, maybe with small differences.
Example:
Ali Raza, 0345-1234567
Alee Raza, 0345-1234567
Might be same person but entered twice.
Converge anomalies:
These happen when data is missing or incomplete.
Types of converge anomalies:
Missing attributes
Missing records
Missing attributes:
Some columns are empty or NULL.
Example:
Ali, , Lahore, Male\
(Missing age)
Missing records:
Whole rows are missing — a person or item is not entered at all.
Example:
A customer bought a product, but no entry was made in the sales table.
Handling missing data:
Dropping Records:
Remove rows with missing values (not always ideal).
Manual Filling:
Manually add missing values using expert knowledge.
Global Constant:
Use a default value like "N/A" or "0".
Mean/Median:
Fill missing values with the column's average or middle value.
Most Probable Value:
Use prediction or history to fill missing values.
Key based classification problem:
These are data errors related to primary keys (PK) or non-primary keys (NPK) in a database.
They affect how data is uniquely identified and connected.
Types:
There are 2 main types of key based classification problem:
Primary key (PK) problems
Non-Primary key (NPK) problems
Primary key (PK) problems:
Primary Key is a unique ID for each record (like student ID, customer number, NID).
Common primary key problem:
Same Pk, different data
Same entity with two different Pk’s
Pk in one system but missing in another
Same PK but different format
Non-Primary key (NPK) problems:
These are errors in other fields like name, address, gender, etc.
Common non-primary key problem:
Same data written in different format
Same thing written in different way
Fake or meaningless values entered
One field name used for different data.
Important fields are left blank or incomplete.
Summary:
Primary Key Same PK, different data ID 101 for two people
Data cleansing:
Data Cleansing (or Scrubbing) is the process of detecting and correcting dirty data before loading it into the
Data Warehouse.
Dirty data:
Dirty Data means incorrect, incomplete, or duplicate data stored in a database or data warehouse that can cause
problems in reports, analysis, or decisions.
Steps of data cleaning: (ESVMHD)
We can break down the data cleansing into six steps:
Elementizing
Standardizing
Verifying
Matching
House holding
Documenting
Automatic Data cleaning technique:
Statistical
Pattern based
Clustering
Association rule
Statistical Method:
Use math (mean, standard deviation) to find outliers (data values that are very different from the rest of the
values in a dataset.).
Example:
If most salaries are between 40k–80k, a 5 million salary is likely an error.
Pattern-Based:
Use pattern rules to find records that don’t match normal behavior.
Example:
If 90% of customers have age 20–50, one with age 300 is a likely error.
Clustering:
Group similar records, find outliers far from all groups.
Drawback: Slow with large data.
Association Rules:
Find rules like “If age < 10, job title ≠ Manager”
Breakers of these rules are marked as errors.
Data Duplication:
Data duplication occurs when multiple records in a dataset refer to the same real-world entity (e.g., a person or
household), but appear in different formats or have slight variations.
Reason for data duplication:
Data comes from multiple sources with different formats.
Typing mistakes, abbreviations, or fraudulent entries.
People write names or addresses differently.
Same person might have multiple IDs or employee numbers in different systems.
Problem caused by data duplication:
Wrong analytics – double counting skews results.
Annoyed customers – multiple marketing calls to the same person.
Fraud – bonus given to same employee multiple times using different IDs.
Wasted resources – extra printing, emails, or analysis efforts
BSN Method:
BSN (Basic Sorted Neighborhood) Method is an algorithm used to detect and eliminate duplicate records in
large datasets. It concatenated the data in one sequential list of N records.
Steps of BSN method:
Create key
Sort Records
Merge (Sliding window)
Create key:
Extract key fields (e.g., first 3 letters of name + street number + NID portion) to create a unique identifier for
each record.
Example:
For “Muhammad Ahmad, 440 Munir Road, NID: 34535322”
→ Key: AHM440MUN345
Sort Records:
Sort all records by the generated key so that similar records come close together.
Merge (Sliding window):
Use a window of size w to scan the sorted list:
Compare each new record to previous w-1 records only.
If similar → consider them duplicates.
Without the window: All pairs (n² comparisons).
With window: Only w comparisons per record → Faster.
Problem in BSN method:
Dirty Keys:
If input data is dirty → key will also be dirty.
Abbreviations or typos reduce matching accuracy.
Solution: Use standard external sources to validate and clean.
Tokenization
If names/addresses are in different orders → mismatch.
Break fields into tokens (words) and sort within.
Example:
"420 Street 15 Chaklala" vs "Street 15, No. 420" → Tokenized sorting helps match.
BSN equational theory (Matching logic):
BSN doesn't rely only on string matching; it uses logic-based matching that is based on domain equivalence
which requires decelerative rule language.
Distance Functions for Matching
Used to calculate how “close” two strings are:
Edit Distance: No. of changes (insert, delete, substitute) to convert one string to another.
Phonetic Distance: How similar the pronunciation is.
Typewriter Distance: Errors from adjacent keys on a keyboard.
Example:
"Ahmad" vs "Ahamd" → Edit distance = 1
BSN Limitations:
One key is not enough to catch all duplicates.
Order of fields in key matters – the first field has more power.
BSN improvement:
BSN can be improved by using multiple pass BSN or Increasing window size (w) increases accuracy but also
cost.
Multi-Pass BSN
Run BSN multiple times with different keys each time:
1st run: sort by name.
2nd run: sort by address.
3rd run: sort by NID.
Then apply transitive closure:
If A ≈ B and B ≈ C → then A ≈ C.
This increases chances of finding duplicates without needing large w.
Data:
All data is a representation of something real. Data has value only when it is used for something useful
Quality:
Quality means how good something is compared to other options. Quality by Philip Crosby is defined as:
“Quality is conformance to requirements”. It means the product or data should meet the user’s needs or
expectations, whether written or not. If data meets written specs but doesn’t satisfy the user, it’s not high quality.
Data Quality:
Data Quality means the data is correct, complete, useful, and up-to-date.
Good quality data helps in making better decisions and running the business smoothly. There are two significant
definition of data quality:
Intrinsic Data Quality
Means the data is correct and accurate.
It matches the real-world fact it represents.
Example:
If your name and address are spelled exactly right in the database.
Realistic Data Quality
Means data is useful for the business.
Even correct data is useless if it doesn't help business work properly.
Example:
A correct phone number is useless if no one needs to call the customer.
Intelligent Learning Organization:
A smart organization that shares good-quality data, learns from experience, and improves together to grow
successfully.
Dysfunctional Learning Organization:
A poor organization that uses bad data, hides information, doesn’t learn from mistakes, and does not work
properly. The term dysfunctional means “impaired or abnormal functioning.”
Min-Max technique:
Min-Max technique is used to combine multiple data quality scores into one final score.
Min (Minimum):
Min (Minimum) operator gives a conservative score — it takes the lowest value. To be safe and conservative —
we want to consider the weakest point of the data quality.
Max (Maximum):
Max (Maximum) operator gives a liberal score — it takes the highest value.
Data Quality Validation:
Data Quality Validation means checking the data to make sure it is correct, complete, and useful.
Common Techniques:
Referential Integrity (RI)
Attribute Domain Validation
Using Data Quality Rules
Data Histograming
Referential Integrity (RI):
RI ensures that relationships between tables are correct.
Example:
If one table refers to another table (like customer_ID in Payments table), then the value must exist in the related
table (Customer table). If the Payments table has a customer_ID that does not exist in the Customer table, it's
called an orphan record.
Attribute Domain Validation:
Checks if the values in a column are from a valid set of values (domain).
Example:
In a Gender column, valid values should be: Male, Female, Other.
If you find Mle or Fem, it's invalid.
Using Data Quality Rules:
Rules are set based on business logic to define what is “correct” data.
Example:
Phone numbers should have 11 digits.
Date of Birth should not be in the future.
Data Histograming:
This technique uses graphs or charts to see how data is distributed. It helps to spot unusual or fake data.
TDQM:
TDQM (Total Data Quality Management) means managing and improving data quality continuously, not just
one time.
TDQM Cycle:
TDQM is a cycle that keeps running to ensure that data stays clean, correct, and useful. Following are the steps
used in TDQM:
Establish data quality environment
Plan the data quality project
Implement the Project (D-M-A-I Cycle)
Evaluate project
Establish data quality environment:
Build a teamwork environment where both Data users (business side) and System developers (technical side)
work together to solve data quality problems.
Example:
Users know what data errors exist (e.g., wrong prices), while developers know how to fix systems. They must
collaborate to make lasting improvements.
Indexing:
Indexing is a way to speed up searching in a database. It works like an index at the back of a book – it helps the
database find data faster without checking every row. Index is purely a physical database concept and has
nothing to do with logical model.
Need for indexing:
Speed
Query Complexity
I/O bottleneck
Types of indexing:
Basic types
Dense
Sparse
Multi-level (or B-Tree)
Primary index
Secondary index
Dense:
In a dense index, every record in the database table has an entry in the index.
Example:
Roll No Name
101 Ali
102 Sara
103 Ayesha
Dense index:
Advantages:
Fast lookups (no need to search further).
Accurate and simple.
Disadvantages:
Needs more memory (larger index size).
Slower when inserting/deleting (more updates needed).
Sparse:
In a sparse index, only some records are indexed (usually one per block or page).
Example:
Roll No Name
101 Ali
102 Sara
103 Ayesha
To find 102, the system finds the closest smaller key (101) and scans forward.
Advantages:
Requires less memory.
Faster to update on insert/delete.
Disadvantages:
Slightly slower than dense index for lookup (needs extra scan).
Multilevel indexing (B tree):
When the index becomes too large, we use multiple levels of indexing to improve performance. The most
common type is B-tree index.
Example:
Think of a phonebook:
First level: A–D, E–H, I–L, M–P, Q–Z
Second level: specific names under each section
This helps jump faster by narrowing search levels.
Advantages:
Fast search even for huge tables.
Balanced structure ensures consistent performance.
Disadvantages:
Complex structure.
Slightly higher storage and maintenance cost.
Primary indexing:
A primary index is built on the primary key or any ordered unique field (e.g., Roll No). It is usually a dense
index.
Example:
Roll No is the primary key. Index is built directly on it.
Advantages:
Very fast and accurate searches.
Automatically updated if primary key is changed.
Disadvantages:
Only one primary index per table.
If primary key is changed, index must be updated.
Secondary Index:
A secondary index is built on non-primary (non-unique) columns. It allows quick search on fields other than the
primary key.
Example:
Index build on the name column:
Doc ID Text
1 Data is valuable
2 Data warehousing is useful
3 Knowledge is power
Word Document Id
data 1,2
Is 1,2,3
Valuable 1
Warehousing 2
Useful 2
Knowledge 3
power 3
Advantages:
Great for full-text search.
Fast lookup for keyword-based queries.
Very low storage space
Low level bit operation
Counts and joins
Disadvantages:
Not suitable for numeric or range-based queries.
Requires preprocessing and regular updates.
Locking of many rows
Low cardinality
Keyword parsing
Difficult to maintain – need reorganization when relation sizes changes (new bitmaps)
Bitmap index:
A Bitmap Index uses bit arrays (0s and 1s) to represent the presence of a value for a given attribute across
records. Best for low-cardinality columns (few unique values).
Example:
Suppose we have a Gender column in a table:
ID Gender
1 M
2 F
3 M
4 F
5 M
Let,
M=1
F=0
Bitmap index will be:
10101
Advantages:
Very efficient for filtering and combining conditions (AND/OR).
Uses less space for low-cardinality columns.
Disadvantages:
Not good for columns with many unique values (like names or IDs).
Bitmaps can grow large for high-cardinality.
RLE (Run Length Encoding):
RLE is a compression technique used in bitmap indexing to reduce storage size by storing runs of repeated
values.
Clustered index:
A Clustered Index determines the physical order of data in the table. There can be only one clustered index per
table.
Example:
Suppose we create a clustered index on StudentID:
Student_ID(clustered) Name
101 Ali
102 Ayesha
103 Sara
Advantages:
Fast range queries and sorting.
Data is directly accessible in order.
Disadvantages:
Insert/update becomes slower (due to maintaining order).
Only one per table (since it affects row storage).
Join index:
A Join Index stores the result of a JOIN operation so that we don’t have to recompute the join every time. Used
in data warehouses where large joins are common.
Advantages:
Speeds up frequent join queries.
Avoids recomputation, saves processing time.
Disadvantages:
Takes extra storage.
Must be updated if base tables change.
Data Mining:
Data mining is the process of discovering useful patterns and knowledge from large datasets using methods
from statistics, machine learning, and database systems.
Types of Data Mining:
Supervised Data Mining
Unsupervised Data Mining
Supervised Data Mining:
Supervised data mining uses labeled training data to learn a function or rule that maps inputs to known outputs.
Example:
Predicting if an email is “spam” or “not spam” (label = spam or not spam).
Predicting house prices based on size and location.
Unsupervised Data Mining:
Unsupervised data mining analyzes input data without labeled responses to find hidden patterns, groupings, or
structures.
Example:
Grouping customers into segments based on behavior (clustering)
Finding items that are often bought together (association rules)
K-Means Clustering:
K-Means Clustering is a way to group similar data points into clusters. You tell the algorithm how many clusters
(K) you want, and it tries to put similar items into the same group. It’s called “K-means” because K is the
number of clusters, and means refers to the average (center) of each cluster.
Or
K-Means is an unsupervised machine learning algorithm that partitions a dataset into K clusters, where each
data point belongs to the cluster with the nearest mean (centroid).
Cluster:
A cluster is a collection of data points aggregated together because of certain similarities or patterns. In
clustering algorithms, each cluster represents a subset of the data that shares common characteristics.
Working:
Choose K (the number of clusters you want).
Place K points randomly as the initial cluster centers (called centroids).
Assign each data point to the nearest cluster (centroid).
Calculate the new centroids (mean of all points in each cluster).
Repeat steps 3–4 until the centroids don’t change much.
Limitation:
It can only work for data sets where there is the concept of mean.
We can to specify the number of clusters in advance.
It is not a robust technique as it not works well in presence of noise.
The cluster found by the K-means have to be convex.
DWH Lifecycle:
The Data Warehouse (DWH) Lifecycle is the step-by-step process used to design, develop, implement, operate,
and improve a data warehouse system. It involves methodologies, strategies, and techniques that guide how a
DWH is built and evolved over time.
Implementation strategies:
Top-down approach
Bottom-up approach
Top-down approach:
A top-down approach is generally useful for the project when the technology is mature and well understood, as
well as where the business problem that must be solved is clear and well understood:
Build the complete DWH first, then data marts.
Suitable when business goals and technology are clear.
Advantage: Structured and scalable.
Disadvantage: Expensive and slow.
Bottom-up approach:
A bottom-up approach is useful when the business objectives that are to be met by the data warehouse are
unclear.
Start with small, specific data marts.
Integrate them later into a full DWH.
Advantage: Quick results, lower cost.
Disadvantage: Integration challenges.
Development methologies:
Development methologies describe the expected evolution and the management of the engineering system.
Waterfall model
Spiral Model
RAD