Experiment 07

smnsjknsjk

Uploaded by

Nawaz Wariya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views6 pages

Experiment 07

smnsjknsjk

Uploaded by

Nawaz Wariya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Experiment 07- Perform Data profiling, data validation, and data cleansing.

Notify
outliers, anomalies present in given dataset.
•
Learning Objective: Student should be able to gain knowledge about data profiling, data validation,
data cleaning and notify anomalies and visualize it using matplotlib.

Tools: Python, Kaggle Datasets

Theory:

• Introduction
This experiment focuses on improving the quality and reliability of data through a complete workflow
of data profiling, data validation, data cleansing, and anomaly detection. High-quality data is critical
because any inconsistency or error can mislead analysis and reduce the accuracy of machine learning
models. Using the Iris dataset as an example, the process highlights how to examine the dataset’s
structure, identify problems, correct them, and finally detect unusual patterns or outliers.
• Data Profiling
Data profiling is the first step. It provides a comprehensive understanding of the dataset by
summarizing key characteristics such as column names, data types, missing values, descriptive
statistics, and correlations between numeric features. Techniques like generating summary statistics
and heatmaps reveal the distribution of values and relationships among variables. This stage acts like
a health check, helping identify potential issues such as unexpected data types or inconsistent ranges
before deeper analysis.
• Data Validation
After profiling, data validation enforces logical and business rules to ensure data integrity. The
experiment checks for duplicate rows, invalid negative values in numeric columns, and other
anomalies that violate the expected schema. These validations confirm that the dataset conforms to
defined constraints, such as all measurements being non-negative. Detecting these violations early
prevents errors from propagating into later analysis or modeling stages.
• Data Cleansing
Data cleansing corrects the problems uncovered during profiling and validation. Missing values are
imputed using mean values for numerical columns and the most frequent value for categorical
columns. Duplicates are removed to avoid double counting, and negative measurements are replaced
with column means to maintain realistic ranges. These steps ensure that the dataset is consistent,
complete, and ready for use in machine learning or statistical modeling.
• Outlier and Anomaly Detection
Finally, the workflow identifies records that deviate significantly from the norm. Two approaches are
demonstrated: the statistical Z-score method flags points that lie more than three standard deviations
from the mean, while the Isolation Forest algorithm applies machine learning to detect unusual
patterns without relying on distributional assumptions. Highlighting these outliers is important
because they can either indicate rare but valid events or point to data entry errors that need further
investigation.
Impleentation:

1. Environment Setup
Imported essential Python libraries—pandas and numpy for data handling, matplotlib and
seaborn for visualization, and sklearn modules (SimpleImputer, IsolationForest, datasets)
for preprocessing and anomaly detection.
2. Data Loading
Loaded the Iris dataset using sklearn.datasets.load_iris and converted it into a Pandas
DataFrame. Added a species column for class labels. To demonstrate cleaning techniques,
deliberately introduced missing values (NaN) and invalid negative values in numeric columns.
3. Data Profiling
Executed df.info(), df.describe(), and df.isnull().sum() to get schema details,
descriptive statistics, and missing value counts. Plotted a correlation heatmap using
seaborn.heatmap to visualize relationships between numerical features.
4. Data Validation
Detected duplicates with df.duplicated().sum(). Checked for invalid negative values in each
numeric column by filtering rows where values were less than zero. This ensured all
measurements stayed within biologically meaningful ranges.
5. Data Cleansing
Applied SimpleImputer with the mean strategy for numeric columns and most-frequent strategy
for categorical columns to replace missing values. Dropped duplicate rows and replaced any
negative numeric entries with the column mean to maintain consistency.
6. Outlier & Anomaly Detection
Used two methods:
o Z-Score Method: Calculated z-scores (x−mean)/std and flagged values with |z| > 3 as
outliers.
o Isolation Forest: Fitted an unsupervised Isolation Forest model (contamination=0.05) to
identify anomalous rows. Marked anomalies in a new Anomaly column.
7. Saving the Cleaned Dataset
Filtered out anomalies (Anomaly == 1) and exported the cleaned DataFrame to a CSV file named
iris_cleaned.csv using DataFrame.to_csv().
Code & Output:
OUTPUT:
Learning Outcomes: The student should have the ability to:
LO 7.1 Identify Outliers and invalid entries
LO 7.2 Apply Profiling tools for data integrity

Course Outcomes: Upon completion of the course students can perform data profiling, data
validation, and data cleansing. Notify outliers, anomalies present in iris datasets

Conclusion:

For Faculty Use:

Correction Formative Timely Attendance/

Parameters Assessment completion of Learning
Practical [
[40%] Attitude [20%]
40%]

Marks
Obtained

Shrey ML EXP 1
No ratings yet
Shrey ML EXP 1
4 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Data Cleaningin ML
No ratings yet
Data Cleaningin ML
15 pages
Session
No ratings yet
Session
19 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
Complete Data Science Questions
No ratings yet
Complete Data Science Questions
5 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Machine Learning Essentials Guide
No ratings yet
Machine Learning Essentials Guide
33 pages
Anomalies in Dataset
No ratings yet
Anomalies in Dataset
4 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Study Material Data Preprocessing
No ratings yet
Study Material Data Preprocessing
11 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Research File 3
No ratings yet
Research File 3
10 pages
Module 3
No ratings yet
Module 3
108 pages
Eda U2
No ratings yet
Eda U2
141 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Module 5 - Data Cleaning and Transformation
No ratings yet
Module 5 - Data Cleaning and Transformation
26 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
ML Lab Record
No ratings yet
ML Lab Record
38 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
46 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
III Unit
No ratings yet
III Unit
4 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Key Ingredients of PM
No ratings yet
Key Ingredients of PM
16 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
FOUND. DATA SCIENCE Practical
No ratings yet
FOUND. DATA SCIENCE Practical
15 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
Exploratory Data Analysis Guide
No ratings yet
Exploratory Data Analysis Guide
33 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Statistics IMP Questions and Answers
No ratings yet
Statistics IMP Questions and Answers
23 pages
Mid Term Project
No ratings yet
Mid Term Project
3 pages
Data Pre-processing in Machine Learning
No ratings yet
Data Pre-processing in Machine Learning
84 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Data Cleaning
No ratings yet
Data Cleaning
40 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Data Cleaning
No ratings yet
Data Cleaning
28 pages
Progress
No ratings yet
Progress
6 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
MTL782 A1
No ratings yet
MTL782 A1
19 pages
Experiment 01
No ratings yet
Experiment 01
2 pages
Android Development Basics
No ratings yet
Android Development Basics
10 pages
Exp - 08 Writeup
No ratings yet
Exp - 08 Writeup
6 pages
Exp 06
No ratings yet
Exp 06
4 pages
Android OS Overview
No ratings yet
Android OS Overview
6 pages
Ajp 14-16
No ratings yet
Ajp 14-16
4 pages
Output-022 PNG
No ratings yet
Output-022 PNG
1 page
Chapter 1
No ratings yet
Chapter 1
2 pages
Practical 12 Osy
No ratings yet
Practical 12 Osy
1 page
Chapter 9 BTC PRICE PRED
No ratings yet
Chapter 9 BTC PRICE PRED
12 pages
Practical No 9 Osy
No ratings yet
Practical No 9 Osy
1 page
Output-174 PNG
No ratings yet
Output-174 PNG
1 page
Output-026 PNG
No ratings yet
Output-026 PNG
1 page
1st Priority V-IMPs With Answer
No ratings yet
1st Priority V-IMPs With Answer
43 pages
Output-170 PNG
No ratings yet
Output-170 PNG
1 page
Output-173 PNG
No ratings yet
Output-173 PNG
1 page
Java Networking MCQ Quiz
No ratings yet
Java Networking MCQ Quiz
9 pages
Hacking in Web Applications
No ratings yet
Hacking in Web Applications
8 pages
Output-175 PNG
No ratings yet
Output-175 PNG
1 page
Output-172 PNG
No ratings yet
Output-172 PNG
1 page
Android Vs iOS OS
No ratings yet
Android Vs iOS OS
2 pages
Hacking in Database
100% (1)
Hacking in Database
7 pages
MCQ For Unit 3 Event Handling
No ratings yet
MCQ For Unit 3 Event Handling
16 pages
Vitalis Report Attachment
No ratings yet
Vitalis Report Attachment
24 pages
Marathon Session On Transducers
No ratings yet
Marathon Session On Transducers
141 pages
Sparkling Wine Profit Strategy
No ratings yet
Sparkling Wine Profit Strategy
6 pages
Historical Development of Language
No ratings yet
Historical Development of Language
2 pages
Chapter 1 Elementary Lesson Plans Social Studies Grade 5 Lesson 3
100% (1)
Chapter 1 Elementary Lesson Plans Social Studies Grade 5 Lesson 3
5 pages
Concrete Curing Methods Guide
No ratings yet
Concrete Curing Methods Guide
8 pages
Ec210d PDF
100% (3)
Ec210d PDF
20 pages
A5 6-84RPV
No ratings yet
A5 6-84RPV
4 pages
Anwar
No ratings yet
Anwar
3 pages
Diaphargm Wall Design
80% (5)
Diaphargm Wall Design
24 pages
Btech II Sem TT Ay 2024-25
No ratings yet
Btech II Sem TT Ay 2024-25
34 pages
Btlgtco Questions & Answers - Clyde Dmello
No ratings yet
Btlgtco Questions & Answers - Clyde Dmello
15 pages
Interview - Hasyim Siraj
No ratings yet
Interview - Hasyim Siraj
28 pages
Transmission Valve Body Assy (U340E) : Replacement
No ratings yet
Transmission Valve Body Assy (U340E) : Replacement
5 pages
Manual For Wheel Adjustment of Eight-Wheel Trolley
No ratings yet
Manual For Wheel Adjustment of Eight-Wheel Trolley
3 pages
LDG12S
No ratings yet
LDG12S
1 page
Macro-Fungi Diversity in Mt. Kitanglad
No ratings yet
Macro-Fungi Diversity in Mt. Kitanglad
72 pages
Minutes of The Meeting: Focused Group Discussion
100% (1)
Minutes of The Meeting: Focused Group Discussion
5 pages
Statistical Methods Q
No ratings yet
Statistical Methods Q
4 pages
Smart Power Processing For Energy Saving: Lab-808: Power Electronic Systems & Chips Lab., NCTU, Taiwan
No ratings yet
Smart Power Processing For Energy Saving: Lab-808: Power Electronic Systems & Chips Lab., NCTU, Taiwan
21 pages
PDU For TMA Catalog (Add 3DT) - Add CEQ V3.0 - 001 PDF
No ratings yet
PDU For TMA Catalog (Add 3DT) - Add CEQ V3.0 - 001 PDF
3 pages
Vernier Caliper Worksheet With Example Solution PDF
100% (7)
Vernier Caliper Worksheet With Example Solution PDF
3 pages
Grade 10.00 Out of 10.00 (100%) : Question Text
No ratings yet
Grade 10.00 Out of 10.00 (100%) : Question Text
69 pages
Solution Manual For Linear Algebra A Modern Introduction 4th Edition Poole 9781285463247 Download Full Chapters
100% (64)
Solution Manual For Linear Algebra A Modern Introduction 4th Edition Poole 9781285463247 Download Full Chapters
59 pages
Claiming The High Ground
No ratings yet
Claiming The High Ground
421 pages
RVP Duo 65
No ratings yet
RVP Duo 65
66 pages
03a Problem Solving Therapy
No ratings yet
03a Problem Solving Therapy
23 pages
Surface Tension for Young Scientists
No ratings yet
Surface Tension for Young Scientists
4 pages
Learning Reactive Programming With Java 8 1st Edition Nickolay Tsvetinov Download
100% (5)
Learning Reactive Programming With Java 8 1st Edition Nickolay Tsvetinov Download
61 pages
Beginner's Guide to Using vi Editor
No ratings yet
Beginner's Guide to Using vi Editor
2 pages

Experiment 07

Uploaded by

Experiment 07

Uploaded by

Experiment 07- Perform Data profiling, data validation, and data cleansing.

Tools: Python, Kaggle Datasets

For Faculty Use:

Correction Formative Timely Attendance/

You might also like