0% found this document useful (0 votes)
23 views29 pages

Cs614 (Final)

ETL (Extract, Transform, Load) is a process in data warehousing that involves extracting data from various sources, transforming it into a suitable format, and loading it into a data warehouse for analysis. The document details the types of extraction (logical and physical), transformation tasks, and loading techniques, along with the importance of data cleansing to ensure data quality. It also discusses data anomalies, duplication issues, and methods for improving data quality through various techniques and organizational practices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views29 pages

Cs614 (Final)

ETL (Extract, Transform, Load) is a process in data warehousing that involves extracting data from various sources, transforming it into a suitable format, and loading it into a data warehouse for analysis. The document details the types of extraction (logical and physical), transformation tasks, and loading techniques, along with the importance of data cleansing to ensure data quality. It also discusses data anomalies, duplication issues, and methods for improving data quality through various techniques and organizational practices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

ETL:

ETL (Extract, Transform, Load) is a simple process used in Data Warehousing (DWH) to collect and prepare
data for analysis. ETL consists of following three processes:

 Extraction
 Transformation
 Loading
Extraction:
Data extraction is the process of getting data from different sources (Extract).
Types of data extraction:
Data extraction is divided into two types:
 Logical extraction
 Physical extraction
Logical extraction:
Logical Extraction is when data is taken in a planned or smart way using a query or rules from the source
system.
Example:
Getting only sales data from the last 3 months using a SQL query.
Types of logical extraction:
Logical extraction is further divided into two types:
 Full extraction
 Incremental extraction
Full extraction:
Full Extraction means getting all the data from the source every time, no matter what has changed.
Example:
Every night, the entire sales table is copied into the data warehouse, even if only a few records changed.
Incremental extraction:
Incremental Extraction means getting only the new or changed data since the last extraction.
Example:
Only the sales records added or updated today are extracted and loaded into the data warehouse.
Physical extraction:
Physical Extraction is when data is taken directly from the source system without much filtering or logic —
often the entire data is copied.
Example:
Copying the whole sales table from a database without any filter.
Types of physical extraction:
Physical extraction can be further divided into following types:
 Online extraction
 Offline extraction
Online extraction:
Online Extraction means getting data directly from a live source system (like a running database) in real time or
near real time.
Example:
Getting sales data directly from the store's database while it is running.
Offline extraction:
Offline Extraction means getting data from a backup file or export file, not directly from the live system.
Example:
Copying data from a CSV or Excel file that was exported from the system earlier.
Transformation:
Data Transformation means changing the data into the right format, structure, or values so it can be stored and
used properly in the Data Warehouse.
Example:
 Changing date format from DD-MM-YYYY to YYYY-MM-DD
 Replacing "F" and "M" with "Female" and "Male"
 Combining first name and last name into full name
Data transformation basic tasks:
Following are some basic tasks that are used in data transformation:
 Selection
 Splitting/Joining
 Conversion
 Summarization
 Enrichment
Selection:
Picking only the needed data from a large dataset.
Example:
Choosing only Name and Email columns from a customer table (ignoring the rest).
Splitting/Joining:
Splitting: Breaking one column into multiple parts.
Joining: Combining data from two or more columns or tables.
Examples:
Splitting: Breaking “Full Name” into “First Name” and “Last Name”.
Joining: Combining customer data with order data to see what each person bought.
Conversion:
Changing data format or type to make it uniform.
Example:
 Converting a text date like “19-06-2025” to a proper date format.
 Changing numbers stored as text into actual numeric data.
Summarization:
Creating summary data from detailed records.
Example:
 Calculating total sales per month
 Finding the average score of students
Enrichment:
Adding more useful information to the data.
Example:
 Adding a city name based on a zip code
 Adding a customer loyalty level based on purchase history
Loading:
Data Loading is the process of moving the transformed (cleaned and formatted) data into the Data Warehouse so
it can be stored and used for reporting and analysis.
Once data is extracted and transformed, the final step is loading it into the DWH using different strategies
depending on:
 How fresh the data needs to be,
 How often the data changes,
 How much storage and speed you have.
Data Loading Techniques:
There are 3 main data loading techniques:
 Full data refresh
 Incremental Data refresh
 Trickle/continues feed
Full data refresh:
Replace everything in the table with the latest full data.
 Also called Block Insert or Block Slamming
 Used mostly during initial loading or when the data changes completely
 Needs more time and storage
Example:
Re-loading the entire customer database, even if only 10% of it changed.
Incremental Data refresh:
Load only the new or changed data into the existing table.
 Faster and more efficient than full refresh
 Uses less storage
 Common for daily or hourly updates
Example:
Only loading sales records added or updated since yesterday.
Trickle/continues feed:
Load data continuously (in real-time or near real-time), as it arrives.
 Very fresh data available at all times
 Uses row-level inserts/updates
 High system load and complexity
Example:
A bank loading each transaction instantly after it happens.
CDC:
CDC stands for change data capture which is a technique that is used in ETL that quickly identifies and process
only that data that has been change not entire table and makes the change data available for the further use.

Finals
Web scrapping:
Web Scraping is the process of automatically collecting information from websites. It’s like a robot reading a
website and picking out the useful data, such as prices, names, news, or reviews.
Data cleansing:
Data Cleansing (or Scrubbing) is the process of detecting and correcting dirty data before loading it into the
Data Warehouse.
Dirty data:
Dirty Data means incorrect, incomplete, or duplicate data stored in a database or data warehouse that can cause
problems in reports, analysis, or decisions.
Example:
 A person born in 1995 but hired in 1985 → Invalid data
 Same customer listed twice with slightly different names → Duplicate
Importance of data cleansing:
 Data cleansing improves the quality of data in the warehouse.
 It ensures correct business decisions based on clean and valid data.
 Dirty data leads to bad reports, wasted resources, and wrong conclusions.
Challenges in Data Cleansing
 Most real-world data is dirty (GIGO – Garbage In, Garbage Out).
 Needs domain experts for rules but must be semi-automatic due to large data.
 Dirty is relative: What’s incorrect in one system may be valid in another.
Data anomalies:
Data anomalies are errors, inconsistencies, or unexpected problems in the data that make it incorrect,
incomplete, or misleading.
Types of data anomalies:
Data anomalies can be divided into following three types:
 Syntactically dirty data
 Semantically dirty data
 Converge anomalies
Syntactically dirty data:
These are format or structure errors in the data (how the data looks).
Types of syntactically dirty data:
 Lexical error
 Irregularities
Lexical error:
Data does not match the expected structure or format.
Example:
A row has only 4 columns, but the table expects 5 columns.
Wrong: Ali, 25, Lahore, Male
Correct: Ali, 25, Lahore, Male, Engineer
Irregularities:
Non-uniform use of values, units, or abbreviations.
Example:
Salary is written as 20,000 in one place (assumed to be Rs.)
and $20,000 in another place (US dollars).
Semantically dirty data:
These are logical or meaning-related errors in data.
Types of semantically dirty data:
 Integrity constraint violation
 Business rule contradiction
 Duplication
Integrity constraint violation:
Data breaks rules set by the database (like unique ID, valid date).
Example:
A person is born in 1990 but hired in 1985.
Business rule contradiction:
Data goes against business logic or company rules.
Example:
A student has grade A but 0% attendance – this breaks school rules.
Duplication:
Same record appears more than once, maybe with small differences.
Example:
 Ali Raza, 0345-1234567
 Alee Raza, 0345-1234567
Might be same person but entered twice.
Converge anomalies:
These happen when data is missing or incomplete.
Types of converge anomalies:
 Missing attributes
 Missing records
Missing attributes:
Some columns are empty or NULL.
Example:
Ali, , Lahore, Male\
(Missing age)
Missing records:
Whole rows are missing — a person or item is not entered at all.
Example:
A customer bought a product, but no entry was made in the sales table.
Handling missing data:
 Dropping Records:
Remove rows with missing values (not always ideal).
 Manual Filling:
Manually add missing values using expert knowledge.
 Global Constant:
Use a default value like "N/A" or "0".
 Mean/Median:
Fill missing values with the column's average or middle value.
 Most Probable Value:
 Use prediction or history to fill missing values.
Key based classification problem:
These are data errors related to primary keys (PK) or non-primary keys (NPK) in a database.
They affect how data is uniquely identified and connected.
Types:
There are 2 main types of key based classification problem:
 Primary key (PK) problems
 Non-Primary key (NPK) problems
Primary key (PK) problems:
Primary Key is a unique ID for each record (like student ID, customer number, NID).
Common primary key problem:
 Same Pk, different data
 Same entity with two different Pk’s
 Pk in one system but missing in another
 Same PK but different format
Non-Primary key (NPK) problems:
These are errors in other fields like name, address, gender, etc.
Common non-primary key problem:
 Same data written in different format
 Same thing written in different way
 Fake or meaningless values entered
 One field name used for different data.
 Important fields are left blank or incomplete.
Summary:

Type Problem Example

Primary Key Same PK, different data ID 101 for two people

Same person, different PKs Ahmed Khan = ID 201 & 350

PK missing in one system No customer ID in support system

Same PK, different format 123456789 vs. 123-456-789

Non-PK Different encoding Gender as M/F or 1/0

Multiple representations FAST, FAST-NU

Invalid data 999-999-9999

Same name, different meaning “Total Sales” confusion

Null or missing fields Blank email or contact

Data cleansing:
Data Cleansing (or Scrubbing) is the process of detecting and correcting dirty data before loading it into the
Data Warehouse.
Dirty data:
Dirty Data means incorrect, incomplete, or duplicate data stored in a database or data warehouse that can cause
problems in reports, analysis, or decisions.
Steps of data cleaning: (ESVMHD)
We can break down the data cleansing into six steps:
 Elementizing
 Standardizing
 Verifying
 Matching
 House holding
 Documenting
Automatic Data cleaning technique:
 Statistical
 Pattern based
 Clustering
 Association rule
Statistical Method:
Use math (mean, standard deviation) to find outliers (data values that are very different from the rest of the
values in a dataset.).
Example:
If most salaries are between 40k–80k, a 5 million salary is likely an error.
Pattern-Based:
Use pattern rules to find records that don’t match normal behavior.
Example:
If 90% of customers have age 20–50, one with age 300 is a likely error.
Clustering:
Group similar records, find outliers far from all groups.
Drawback: Slow with large data.
Association Rules:
Find rules like “If age < 10, job title ≠ Manager”
Breakers of these rules are marked as errors.
Data Duplication:
Data duplication occurs when multiple records in a dataset refer to the same real-world entity (e.g., a person or
household), but appear in different formats or have slight variations.
Reason for data duplication:
 Data comes from multiple sources with different formats.
 Typing mistakes, abbreviations, or fraudulent entries.
 People write names or addresses differently.
 Same person might have multiple IDs or employee numbers in different systems.
Problem caused by data duplication:
 Wrong analytics – double counting skews results.
 Annoyed customers – multiple marketing calls to the same person.
 Fraud – bonus given to same employee multiple times using different IDs.
 Wasted resources – extra printing, emails, or analysis efforts
BSN Method:
BSN (Basic Sorted Neighborhood) Method is an algorithm used to detect and eliminate duplicate records in
large datasets. It concatenated the data in one sequential list of N records.
Steps of BSN method:
 Create key
 Sort Records
 Merge (Sliding window)
Create key:
Extract key fields (e.g., first 3 letters of name + street number + NID portion) to create a unique identifier for
each record.
Example:
For “Muhammad Ahmad, 440 Munir Road, NID: 34535322”
→ Key: AHM440MUN345
Sort Records:
Sort all records by the generated key so that similar records come close together.
Merge (Sliding window):
Use a window of size w to scan the sorted list:
 Compare each new record to previous w-1 records only.
 If similar → consider them duplicates.
Without the window: All pairs (n² comparisons).
With window: Only w comparisons per record → Faster.
Problem in BSN method:
Dirty Keys:
 If input data is dirty → key will also be dirty.
 Abbreviations or typos reduce matching accuracy.
Solution: Use standard external sources to validate and clean.
Tokenization
 If names/addresses are in different orders → mismatch.
 Break fields into tokens (words) and sort within.
Example:
"420 Street 15 Chaklala" vs "Street 15, No. 420" → Tokenized sorting helps match.
BSN equational theory (Matching logic):
BSN doesn't rely only on string matching; it uses logic-based matching that is based on domain equivalence
which requires decelerative rule language.
Distance Functions for Matching
Used to calculate how “close” two strings are:
Edit Distance: No. of changes (insert, delete, substitute) to convert one string to another.
Phonetic Distance: How similar the pronunciation is.
Typewriter Distance: Errors from adjacent keys on a keyboard.
Example:
"Ahmad" vs "Ahamd" → Edit distance = 1
BSN Limitations:
 One key is not enough to catch all duplicates.
 Order of fields in key matters – the first field has more power.
BSN improvement:
BSN can be improved by using multiple pass BSN or Increasing window size (w) increases accuracy but also
cost.
Multi-Pass BSN
Run BSN multiple times with different keys each time:
 1st run: sort by name.
 2nd run: sort by address.
 3rd run: sort by NID.
Then apply transitive closure:
 If A ≈ B and B ≈ C → then A ≈ C.
This increases chances of finding duplicates without needing large w.
Data:
All data is a representation of something real. Data has value only when it is used for something useful
Quality:
Quality means how good something is compared to other options. Quality by Philip Crosby is defined as:
“Quality is conformance to requirements”. It means the product or data should meet the user’s needs or
expectations, whether written or not. If data meets written specs but doesn’t satisfy the user, it’s not high quality.
Data Quality:
Data Quality means the data is correct, complete, useful, and up-to-date.
Good quality data helps in making better decisions and running the business smoothly. There are two significant
definition of data quality:
Intrinsic Data Quality
 Means the data is correct and accurate.
 It matches the real-world fact it represents.
Example:
If your name and address are spelled exactly right in the database.
Realistic Data Quality
 Means data is useful for the business.
 Even correct data is useless if it doesn't help business work properly.
Example:
A correct phone number is useless if no one needs to call the customer.
Intelligent Learning Organization:
A smart organization that shares good-quality data, learns from experience, and improves together to grow
successfully.
Dysfunctional Learning Organization:
A poor organization that uses bad data, hides information, doesn’t learn from mistakes, and does not work
properly. The term dysfunctional means “impaired or abnormal functioning.”

Orr’s Laws of Data Quality


 Data that is not used cannot be correct.
If no one has used it, don’t assume it’ s correct.
 Data quality depends on how it’s used, not how it’s collected.
Using data reveals its quality.
 Data is only as good as its toughest use.
Data that works for simple tasks may fail in detailed tasks.
 Old systems have more data problems.
Data degrades over time (missing context, outdated tech).
 Rare problems cause big damage.
One-off issues (e.g., one wrong NID) can cause serious harm.
Data quality management (DQM):
Data Quality Management is the process of checking, improving, and maintaining the quality of data.
Its goal is to make sure that the data is accurate, complete, and useful for the business.
Total Quality Management (TQM):
TQM means that everyone in a company works together continuously to improve quality in everything they do,
with the main goal of satisfying the customer.
Main feature of TQM:
 Involvement of everyone
 Customer-Oriented Approach
 Continuous Improvement
Involvement of everyone:
All employees — from top to bottom — help in solving problems and improving the quality.
Customer-Oriented Approach:
The main goal is to make customers happy by giving them better service or products.
Continuous Improvement:
TQM is not a one-time fix. It’s a daily process of checking and improving.
TQM Components:
TQM have following two components:
 Quality assurance
 Quality improvement
Quality assurance:
Making sure that the product follows the standards (rules set earlier).
Quality improvement:
Trying to make products or services better and faster for the customer.
Law of Diminishing Returns:
This means that spending too much to improve quality gives less and less benefit.
Cost of Data Quality Defect:
These are extra costs a company faces because of bad or low-quality data. It includes the money spent to fix
errors, missed opportunities, and expenses on tools and train.
Types of Costs Due to Bad Data:
 Controllable cost
 Resultant cost
 Equipment and training cost
Controllable cost:
Costs that can be planned, managed, and controlled — like checking, fixing, and preventing data errors.
Example:
 Hiring staff to clean incorrect data.
 Creating rules to stop errors from happening again.
 Running weekly checks on customer records.
Resultant cost:
Losses or problems caused when bad data leads to bad business decisions or missed chances.
Example:
 Sending payment to a wrong or retired employee (ghost employee).
 Losing a customer because of incorrect contact details.
Equipment and training cost:
Money spent on tools, software, and training needed to improve data quality.
Example:
 Buying software for data cleaning.
 Training staff to enter correct data.
Characteristics (Dimensions) of Data Quality:

Characteristic Meaning (Simple Definition)


Accuracy Data is correct and has no errors.
Completeness No missing values where data is required.
Consistency Data follows defined rules or standards.
Timeliness Data is current and up-to-date.
Uniqueness No duplicate records.
Interpretability Data is clear, in the right format, with proper labels/units.
Accessibility Data is easy to access and retrieve.
Objectivity Data is neutral, not biased.

Data Quality Assessment:


Data Quality Assessment is the process of checking how good or bad the data is. It helps find problems like
errors, missing values, duplicates, or outdated data.
Techniques to Measure Data Quality:
 Simple ratio
 Min- Max
 Believability
 Appropriate amount of data
Simple ratio:
This method uses ratios (like percentages) to measure the quality of data. The simple ratio measures the ratio of
desired outcomes to total outcomes.
Example:
Let’s say 90 records are correct out of 100 total records.
Then the simple ratio = 90 / 100 = 0.9 (or 90% accurate)
Uses:
Simple ratio is used for following:
 Free of error
 Completeness
 Consistency
Free of error:
It shows how many records are without mistakes. Following is the formula for free of error:
Free-of-error = 1 - (Number of Errors / Total Records)
Example:
Out of 100 records, 5 have mistakes.
So, Free-of-error = (100 - 5) / 100 = 0.95 (or 95%)
Completeness:
It checks if all required values or data entries are present.
Types:
Schema Completeness: Are all fields/columns included?
Column Completeness: Are any values missing in a column?
Population Completeness: Are all expected values (like all 36 Punjab districts) present?
Example:
If only 30 out of 36 districts are listed, completeness = 30/36 = 0.83 (or 83%)
Consistency:
It checks whether the data matches across different systems or tables.
Example:
If a customer's name is “Ali Khan” in one table and “A. Khan” in another, it's inconsistent.

Min-Max technique:
Min-Max technique is used to combine multiple data quality scores into one final score.
Min (Minimum):
Min (Minimum) operator gives a conservative score — it takes the lowest value. To be safe and conservative —
we want to consider the weakest point of the data quality.
Max (Maximum):
Max (Maximum) operator gives a liberal score — it takes the highest value.
Data Quality Validation:
Data Quality Validation means checking the data to make sure it is correct, complete, and useful.
Common Techniques:
 Referential Integrity (RI)
 Attribute Domain Validation
 Using Data Quality Rules
 Data Histograming
Referential Integrity (RI):
RI ensures that relationships between tables are correct.
Example:
If one table refers to another table (like customer_ID in Payments table), then the value must exist in the related
table (Customer table). If the Payments table has a customer_ID that does not exist in the Customer table, it's
called an orphan record.
Attribute Domain Validation:
Checks if the values in a column are from a valid set of values (domain).
Example:
In a Gender column, valid values should be: Male, Female, Other.
If you find Mle or Fem, it's invalid.
Using Data Quality Rules:
Rules are set based on business logic to define what is “correct” data.
Example:
 Phone numbers should have 11 digits.
 Date of Birth should not be in the future.
Data Histograming:
This technique uses graphs or charts to see how data is distributed. It helps to spot unusual or fake data.

TDQM:
TDQM (Total Data Quality Management) means managing and improving data quality continuously, not just
one time.
TDQM Cycle:
TDQM is a cycle that keeps running to ensure that data stays clean, correct, and useful. Following are the steps
used in TDQM:
 Establish data quality environment
 Plan the data quality project
 Implement the Project (D-M-A-I Cycle)
 Evaluate project
Establish data quality environment:
Build a teamwork environment where both Data users (business side) and System developers (technical side)
work together to solve data quality problems.
Example:
Users know what data errors exist (e.g., wrong prices), while developers know how to fix systems. They must
collaborate to make lasting improvements.

Plan the data quality project:


Create a detailed plan for the data quality improvement project.
Plan includes:
 What is the goal of the project?
 Which data will be checked?
 Which tools will be used?
 What are the deadlines?
 How much will it cost?
Implement the Project (D-M-A-I Cycle):
This step includes four key actions:
1)Define:
Find out what the user needs and how to measure data quality.
Example: Every customer must have a valid email.
Measure:
Check the actual data using those rules.
Example: Count how many customers are missing email addresses.
Analyze:
Find out why the data is wrong or missing.
Example: Emails were optional in the form.
Improve:
Fix the source of the problem.
Example: Make email mandatory and apply validation rules.
Evaluate project:
After doing the project, review whether the data quality really improved or not.
The house of Quality:
The House of Quality is a technique developed in Japan (Mitsubishi) to connect customer needs with product
specifications using a matrix.
Purpose:
It reduces two major risks:
 Product features don’t match customer needs.
 Final product doesn’t meet specifications.
Data quality improvements:
The four categories of Data Quality Improvement:
 Process
 System
 Policy & Procedure
 Data Design
Process: 
Fix how data is entered and used.
Example: Centralize data entry so it's more consistent.
System:
Use better tools, hardware, and training.
Example: Use faster servers so you don’t skip checks during big uploads.
Policy & Procedure:
Make rules and standard procedures.
Example: Make a rule that every DOB must be filled.
Data Design:
Improve how your database is built.
Example: Use primary keys, constraints, and indexes to avoid wrong or duplicate data.
Quality Management Maturity Grid:
This grid shows how mature an organization is in managing data quality. There are 5 stages:
1. Stage 1: Uncertainty (Denial Stage)
2. Stage 2: Awakening (Talk but no action)
3. Stage 3: Enlightenment (Start Measuring)
4. Stage 4: Wisdom (Prevention mode)
5. Stage 5: Certainty (Fully mature)
Common Misconceptions about Data Quality:
 You can fix data.
 Data quality is an IT problem
 It’s the fall of data entry
 The data ware house has the only correct data.
 Just compare with master copy
Parallelism:
Parallel execution (or parallelism) means dividing a big task into smaller tasks and running them at the same
time using multiple processors.
Uses if parallelism:
Parallelism is useful when:
 You're working with huge data (like in data warehouses).
 Tasks can be done independently (no need to wait for each other).
Best for:
 Scanning large tables
 Joining large tables
 Creating large indexes
 Inserting or deleting a lot of data
 Copying and summarizing data
Conditions for Effective Parallelism:
To benefit from parallelism, the system must have:
 Multi-processor hardware (like clusters or SMP systems)
 High I/O bandwidth (for fast data transfer)
 CPUs that are not already overloaded (e.g., <30% usage)
 Enough RAM to handle multiple parallel tasks
Note:
Parallelism can reduce system performance on over-utilized systems or systems with small I/O bandwidth.
Scalability:
Scalability means the system’s ability to handle more data, users, and complexity without slowing down.
As a data warehouse grows:
 More users query it.
 Questions become more complex.
 Data models become more detailed.
So, it’s not just about how big your database is, but how smartly it handles growth.
Types of scalabilities:
There are two types of scalabilities:
Speed up – If you add more processors, the same task should finish faster.
Scale up – if both data size and resources grow together, the task should still take the same time.
Amdahl’s Law:
Amdahl’s Law tells us how much speed we can really gain by using parallelism, depending on how much of the
task must be done sequentially.
Formula:
Speedup (S) = 1 / (f + ((1 - f)/N))
Where:
 f = fraction of task that must be done sequentially (can’t be parallelized)
 N = number of processors
Types of parallel processing architecture:
 Symmetrical Multi processing (SMP)
 Massively parallel processing (MPP)
 Non uniform memory access (NUMA)
Symmetrical Multi processing:
All CPUs share the same memory and disk.
Example:
Windows or Linux running on a multi-core system.
Massively parallel processing:
Each CPU has its own memory; CPUs communicate via network.
Example:
Google data centers or high-performance clusters.
Non uniform memory access (NUMA):
Some memory is closer to one CPU and farther from another.
Example:
A CPU accesses its own memory faster than others’, like local vs remote cupboards.
Types of software architecture in parallelism:
 Shared Memory: All processors access the same memory.
 Shared Disk: All processors access the same disk.
 Shared Nothing: Each processor has its own disk and memory.
 Shared Everything: Combines all shared resources.
Distributed memory machine:
A Distributed Memory Machine is a system where each processor has its own memory, and processors talk to
each other using a network to share data.
Shared memory machines:
A Shared Memory Machine is a system where all processors share the same memory, and they can access any
part of the memory directly.
Virtual Shared memory (VSM):
Virtual Shared Memory is a system where each processor has its own memory, but software makes it look like
all processors are sharing one big memory. It acts like shared memory, even though the memory is actually
distributed across multiple machines.
Shared Disk RDBMS:
Multiple database instances access the same disk storage.
Example:
All branches of a bank access the same central customer database.
Advantage: High fault tolerance (data stays available).
Disadvantage: Locking and coordination can slow it down.
Shared nothing RDBMS;
Each database instance has its own storage (no shared storage).
Example:
Each store has its own sales database. If one fails, others keep working.
Advantage: No lock conflicts, highly parallel.
Disadvantage: If one node fails, its data is not accessible.

Partitioning in Shared Nothing RDBMS:


Partitioning = Splitting data into parts.
Types of partitioning:
 Range partitioning
Based on value range (e.g., sales from Jan to Mar).
 Hash partitioning
Uses a formula to divide data evenly.
 List partitioning
Based on specific values (e.g., city names).
 Round robin
Like dealing cards evenly to players.
Data Parallelism:
Data parallelism means running the same operation on different chunks (partitions) of data at the same time.
Working:
 Data is divided into partitions (static/dynamic).
 Each partition is processed by a separate process or "query server".
 A query coordinator gathers results from all processes.
Spatial Parallelism (Pipelining):
Splitting a big task into small steps and performing these steps in parallel on different data.
Working:
 Each step is called a “stage” in the pipeline.
 Once the first stage is done with data, the second starts immediately, like an assembly line.
Skew in partitioning:
Skew means uneven data distribution, which causes load imbalance during parallel execution.
Types:
Attribute Skew – Some values are more frequent.
Example: Cancellation data mostly from Q4-2001 (air travel).
Partition Skew – Some partitions have more data than others.
Example: One server gets 80% of data; others get little.

Indexing:
Indexing is a way to speed up searching in a database. It works like an index at the back of a book – it helps the
database find data faster without checking every row. Index is purely a physical database concept and has
nothing to do with logical model.
Need for indexing:
 Speed
 Query Complexity
 I/O bottleneck
Types of indexing:
 Basic types
 Dense
 Sparse
 Multi-level (or B-Tree)
 Primary index
 Secondary index
Dense:
In a dense index, every record in the database table has an entry in the index.
Example:
Roll No Name
101 Ali
102 Sara
103 Ayesha

Dense index:

Roll No (Index) Pointer to Record


101 →Ali
102 →Sara
103 →Ayesha

Advantages:
 Fast lookups (no need to search further).
 Accurate and simple.
Disadvantages:
 Needs more memory (larger index size).
 Slower when inserting/deleting (more updates needed).
Sparse:

In a sparse index, only some records are indexed (usually one per block or page).

Example:

Roll No Name
101 Ali
102 Sara
103 Ayesha

Indexing after every second row:

Roll No (Index) Pointer to Record


101 →Ali
103 →Ayesha

To find 102, the system finds the closest smaller key (101) and scans forward.
Advantages:
 Requires less memory.
 Faster to update on insert/delete.
Disadvantages:
 Slightly slower than dense index for lookup (needs extra scan).
Multilevel indexing (B tree):
When the index becomes too large, we use multiple levels of indexing to improve performance. The most
common type is B-tree index.
Example:
Think of a phonebook:
 First level: A–D, E–H, I–L, M–P, Q–Z
 Second level: specific names under each section
This helps jump faster by narrowing search levels.
Advantages:
 Fast search even for huge tables.
 Balanced structure ensures consistent performance.
Disadvantages:
 Complex structure.
 Slightly higher storage and maintenance cost.
Primary indexing:
A primary index is built on the primary key or any ordered unique field (e.g., Roll No). It is usually a dense
index.
Example:
Roll No is the primary key. Index is built directly on it.
Advantages:
 Very fast and accurate searches.
 Automatically updated if primary key is changed.
Disadvantages:
 Only one primary index per table.
 If primary key is changed, index must be updated.
Secondary Index:
A secondary index is built on non-primary (non-unique) columns. It allows quick search on fields other than the
primary key.
Example:
Index build on the name column:

Name (Index) Pointer to record


Ali → 101
Ayesha → 103
Advantages:
 Allows fast search on non-key columns.
 You can have multiple secondary indexes on different columns.
Disadvantages:
 Needs extra storage.
 Slower than primary index.
 Needs updates more frequently.
Special Index Structure:
 Inverted index
 Bitmap index
 Clustered index
 Join index
Inverted index:
An Inverted Index is used mainly for text data (like documents or web pages). It maps words (or terms) to the
list of documents (or rows) in which they appear — like an index at the back of a book.
Example:
Let’s say we have 3 documents:

Doc ID Text
1 Data is valuable
2 Data warehousing is useful
3 Knowledge is power

Inverted index would look like:

Word Document Id
data 1,2
Is 1,2,3
Valuable 1
Warehousing 2
Useful 2
Knowledge 3
power 3

Advantages:
 Great for full-text search.
 Fast lookup for keyword-based queries.
 Very low storage space
 Low level bit operation
 Counts and joins

Disadvantages:
Not suitable for numeric or range-based queries.
Requires preprocessing and regular updates.
Locking of many rows
Low cardinality
Keyword parsing
Difficult to maintain – need reorganization when relation sizes changes (new bitmaps)
Bitmap index:
A Bitmap Index uses bit arrays (0s and 1s) to represent the presence of a value for a given attribute across
records. Best for low-cardinality columns (few unique values).
Example:
Suppose we have a Gender column in a table:

ID Gender
1 M
2 F
3 M
4 F
5 M

Let,
M=1
F=0
Bitmap index will be:
10101
Advantages:
Very efficient for filtering and combining conditions (AND/OR).
Uses less space for low-cardinality columns.
Disadvantages:
Not good for columns with many unique values (like names or IDs).
Bitmaps can grow large for high-cardinality.
RLE (Run Length Encoding):
RLE is a compression technique used in bitmap indexing to reduce storage size by storing runs of repeated
values.
Clustered index:
A Clustered Index determines the physical order of data in the table. There can be only one clustered index per
table.
Example:
Suppose we create a clustered index on StudentID:

Student_ID(clustered) Name
101 Ali
102 Ayesha
103 Sara

Advantages:
 Fast range queries and sorting.
 Data is directly accessible in order.
Disadvantages:
 Insert/update becomes slower (due to maintaining order).
 Only one per table (since it affects row storage).
Join index:
A Join Index stores the result of a JOIN operation so that we don’t have to recompute the join every time. Used
in data warehouses where large joins are common.
Advantages:
 Speeds up frequent join queries.
 Avoids recomputation, saves processing time.
Disadvantages:
 Takes extra storage.
 Must be updated if base tables change.
Data Mining:
Data mining is the process of discovering useful patterns and knowledge from large datasets using methods
from statistics, machine learning, and database systems.
Types of Data Mining:
 Supervised Data Mining
 Unsupervised Data Mining
Supervised Data Mining:
Supervised data mining uses labeled training data to learn a function or rule that maps inputs to known outputs.
Example:
 Predicting if an email is “spam” or “not spam” (label = spam or not spam).
 Predicting house prices based on size and location.
Unsupervised Data Mining:
Unsupervised data mining analyzes input data without labeled responses to find hidden patterns, groupings, or
structures.
Example:
 Grouping customers into segments based on behavior (clustering)
 Finding items that are often bought together (association rules)
K-Means Clustering:
K-Means Clustering is a way to group similar data points into clusters. You tell the algorithm how many clusters
(K) you want, and it tries to put similar items into the same group. It’s called “K-means” because K is the
number of clusters, and means refers to the average (center) of each cluster.
Or
K-Means is an unsupervised machine learning algorithm that partitions a dataset into K clusters, where each
data point belongs to the cluster with the nearest mean (centroid).
Cluster:
A cluster is a collection of data points aggregated together because of certain similarities or patterns. In
clustering algorithms, each cluster represents a subset of the data that shares common characteristics.
Working:
 Choose K (the number of clusters you want).
 Place K points randomly as the initial cluster centers (called centroids).
 Assign each data point to the nearest cluster (centroid).
 Calculate the new centroids (mean of all points in each cluster).
 Repeat steps 3–4 until the centroids don’t change much.
Limitation:
 It can only work for data sets where there is the concept of mean.
 We can to specify the number of clusters in advance.
 It is not a robust technique as it not works well in presence of noise.
 The cluster found by the K-means have to be convex.
DWH Lifecycle:
The Data Warehouse (DWH) Lifecycle is the step-by-step process used to design, develop, implement, operate,
and improve a data warehouse system. It involves methodologies, strategies, and techniques that guide how a
DWH is built and evolved over time.
Implementation strategies:
 Top-down approach
 Bottom-up approach
Top-down approach:
A top-down approach is generally useful for the project when the technology is mature and well understood, as
well as where the business problem that must be solved is clear and well understood:
 Build the complete DWH first, then data marts.
 Suitable when business goals and technology are clear.
Advantage: Structured and scalable.
Disadvantage: Expensive and slow.
Bottom-up approach:
A bottom-up approach is useful when the business objectives that are to be met by the data warehouse are
unclear.
 Start with small, specific data marts.
 Integrate them later into a full DWH.
Advantage: Quick results, lower cost.
Disadvantage: Integration challenges.
Development methologies:
Development methologies describe the expected evolution and the management of the engineering system.
 Waterfall model
 Spiral Model
 RAD

You might also like