Experiment 07- Perform Data profiling, data validation, and data cleansing.
Notify
outliers, anomalies present in given dataset.
•
Learning Objective: Student should be able to gain knowledge about data profiling, data validation,
data cleaning and notify anomalies and visualize it using matplotlib.
Tools: Python, Kaggle Datasets
Theory:
• Introduction
This experiment focuses on improving the quality and reliability of data through a complete workflow
of data profiling, data validation, data cleansing, and anomaly detection. High-quality data is critical
because any inconsistency or error can mislead analysis and reduce the accuracy of machine learning
models. Using the Iris dataset as an example, the process highlights how to examine the dataset’s
structure, identify problems, correct them, and finally detect unusual patterns or outliers.
• Data Profiling
Data profiling is the first step. It provides a comprehensive understanding of the dataset by
summarizing key characteristics such as column names, data types, missing values, descriptive
statistics, and correlations between numeric features. Techniques like generating summary statistics
and heatmaps reveal the distribution of values and relationships among variables. This stage acts like
a health check, helping identify potential issues such as unexpected data types or inconsistent ranges
before deeper analysis.
• Data Validation
After profiling, data validation enforces logical and business rules to ensure data integrity. The
experiment checks for duplicate rows, invalid negative values in numeric columns, and other
anomalies that violate the expected schema. These validations confirm that the dataset conforms to
defined constraints, such as all measurements being non-negative. Detecting these violations early
prevents errors from propagating into later analysis or modeling stages.
• Data Cleansing
Data cleansing corrects the problems uncovered during profiling and validation. Missing values are
imputed using mean values for numerical columns and the most frequent value for categorical
columns. Duplicates are removed to avoid double counting, and negative measurements are replaced
with column means to maintain realistic ranges. These steps ensure that the dataset is consistent,
complete, and ready for use in machine learning or statistical modeling.
• Outlier and Anomaly Detection
Finally, the workflow identifies records that deviate significantly from the norm. Two approaches are
demonstrated: the statistical Z-score method flags points that lie more than three standard deviations
from the mean, while the Isolation Forest algorithm applies machine learning to detect unusual
patterns without relying on distributional assumptions. Highlighting these outliers is important
because they can either indicate rare but valid events or point to data entry errors that need further
investigation.
Impleentation:
1. Environment Setup
Imported essential Python libraries—pandas and numpy for data handling, matplotlib and
seaborn for visualization, and sklearn modules (SimpleImputer, IsolationForest, datasets)
for preprocessing and anomaly detection.
2. Data Loading
Loaded the Iris dataset using sklearn.datasets.load_iris and converted it into a Pandas
DataFrame. Added a species column for class labels. To demonstrate cleaning techniques,
deliberately introduced missing values (NaN) and invalid negative values in numeric columns.
3. Data Profiling
Executed df.info(), df.describe(), and df.isnull().sum() to get schema details,
descriptive statistics, and missing value counts. Plotted a correlation heatmap using
seaborn.heatmap to visualize relationships between numerical features.
4. Data Validation
Detected duplicates with df.duplicated().sum(). Checked for invalid negative values in each
numeric column by filtering rows where values were less than zero. This ensured all
measurements stayed within biologically meaningful ranges.
5. Data Cleansing
Applied SimpleImputer with the mean strategy for numeric columns and most-frequent strategy
for categorical columns to replace missing values. Dropped duplicate rows and replaced any
negative numeric entries with the column mean to maintain consistency.
6. Outlier & Anomaly Detection
Used two methods:
o Z-Score Method: Calculated z-scores (x−mean)/std and flagged values with |z| > 3 as
outliers.
o Isolation Forest: Fitted an unsupervised Isolation Forest model (contamination=0.05) to
identify anomalous rows. Marked anomalies in a new Anomaly column.
7. Saving the Cleaned Dataset
Filtered out anomalies (Anomaly == 1) and exported the cleaned DataFrame to a CSV file named
iris_cleaned.csv using DataFrame.to_csv().
Code & Output:
OUTPUT:
Learning Outcomes: The student should have the ability to:
LO 7.1 Identify Outliers and invalid entries
LO 7.2 Apply Profiling tools for data integrity
Course Outcomes: Upon completion of the course students can perform data profiling, data
validation, and data cleansing. Notify outliers, anomalies present in iris datasets
Conclusion:
For Faculty Use:
Correction Formative Timely Attendance/
Parameters Assessment completion of Learning
Practical [
[40%] Attitude [20%]
40%]
Marks
Obtained