Exploratory Data Analysis (EDA) in Data Science
1. Introduction to EDA
Exploratory Data Analysis (EDA) is a fundamental step in data science and machine
learning that involves analyzing datasets to summarize their key characteristics, identify
patterns, and detect anomalies before applying predictive models.
Objectives of EDA:
Understand data structure and patterns.
Identify missing values, outliers, and inconsistencies.
Discover relationships between variables.
Validate assumptions before building models.
Improve data quality through feature engineering.
2. Steps in Exploratory Data Analysis
Step Description
Load Data Import dataset using Pandas
Understand Structure View column types, missing values, and basic stats
Handle Missing Values Remove or fill NaNs (mean, median, mode)
Remove Duplicates Identify and drop duplicate rows
Visualize Data Histograms, boxplots, scatter plots, heatmaps
Outlier Detection Use IQR or boxplots
Handle Categorical Data Convert to numeric format (one-hot, label encoding)
Feature Engineering Create new features and scale data
Save Cleaned Data Store processed dataset for modeling
Step 1: Load the Dataset
Import necessary libraries and read the dataset.
import pandas as pd
df = pd.read_csv("data.csv") # Replace with actual file path
print(df.head()) # Display first five rows
Step 2: Understand Data Structure
View column types, null values, and basic information.
print(df.info()) # Column names, data types, non-null values
print(df.describe()) # Summary statistics (mean, median, etc.)
3. Handling Missing Data
Missing data can impact model accuracy. Common techniques to handle missing values:
Remove missing values: df.dropna()
Fill missing values with mean/median/mode:
df.fillna(df.mean(), inplace=True) # Fill numerical NaNs with mean
df.fillna(df.mode().iloc[0], inplace=True) # Fill categorical NaNs with
mode
4. Handling Duplicate Data
Detect and remove duplicate rows to avoid redundancy.
print("Duplicates:", df.duplicated().sum()) # Count duplicate rows
df.drop_duplicates(inplace=True) # Remove duplicates
5. Data Visualization for EDA
A. Univariate Analysis (Single Variable)
1. Histogram (Data Distribution)
o Helps understand the spread of numerical features.
2. import matplotlib.pyplot as plt
3. df["column_name"].hist(bins=30)
4. plt.show()
5. Boxplot (Outlier Detection)
o Shows quartiles and outliers.
6. import seaborn as sns
7. sns.boxplot(df["column_name"])
8. plt.show()
B. Bivariate Analysis (Two Variables)
1. Scatter Plot (Correlation between two features)
o Used for continuous variables.
2. sns.scatterplot(x="feature1", y="feature2", data=df)
3. plt.show()
4. Correlation Heatmap
o Shows relationships between numerical variables.
5. plt.figure(figsize=(10,6))
6. sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
7. plt.show()
8. Pairplot
o Visualizes pairwise relationships.
9. sns.pairplot(df)
10. plt.show()
6. Outlier Detection and Handling
A. Using IQR (Interquartile Range) Method
Remove data points beyond 1.5 times the IQR.
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df_cleaned = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 *
IQR))).any(axis=1)]
7. Handling Categorical Data
A. Encoding Categorical Variables
1. One-Hot Encoding (Best for nominal categories)
df = pd.get_dummies(df, columns=["categorical_column"], drop_first=True)
2. Label Encoding (For ordinal categories like Low, Medium, High)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df["encoded_column"] = encoder.fit_transform(df["categorical_column"])
8. Feature Engineering
Creating new meaningful features to improve models.
A. Creating a New Feature
df["new_feature"] = df["feature1"] * df["feature2"]
B. Feature Scaling
1. Min-Max Scaling (Rescale to range 0-1)
2. from sklearn.preprocessing import MinMaxScaler
3. scaler = MinMaxScaler()
4. df_scaled = scaler.fit_transform(df)
5. Standardization (Mean = 0, Std Dev = 1)
6. from sklearn.preprocessing import StandardScaler
7. scaler = StandardScaler()
8. df_scaled = scaler.fit_transform(df)
9. Saving the Cleaned Dataset
df.to_csv("cleaned_data.csv", index=False)
EDA is a crucial step in data science that ensures data quality and model accuracy. By
exploring and visualizing the dataset, we can make informed decisions before applying
machine learning models.