MACHINE LEARNING LAB BCSL606
Program No 1
Develop a program to create histograms for all numerical features and analyze
the distribution of each feature. Generate box plots for all numerical features
and identify any outliers. Use California Housing dataset.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
# Load the dataset
california_housing = fetch_california_housing()
data = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
data['MedHouseVal'] = california_housing.target # Add the target variable to the dataframe
# Plot histograms for all numerical features
data.hist(bins=30, figsize=(15, 10))
plt.suptitle('Histograms of Numerical Features')
plt.show()
# Plot box plots for all numerical features
plt.figure(figsize=(15, 10))
num_features = len(data.columns)
for i, column in enumerate(data.columns, 1): # Start index from 1 for subplot
plt.subplot((num_features // 3) + 1, 3, i) # Adjust grid size dynamically
sns.boxplot(y=data[column])
plt.title(column)
plt.tight_layout()
plt.suptitle('Box Plots of Numerical Features', y=1.02) # Adjust position to avoid overlap
NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 1
MACHINE LEARNING LAB BCSL606
plt.show()
# Identify outliers using the IQR method
for column in data.columns:
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
print(f"Outliers in {column}: {len(outliers)}")
Output:
Outliers in MedInc: 681
Outliers in HouseAge: 0
Outliers in AveRooms: 511
Outliers in AveBedrms: 1424
Outliers in Population: 1196
Outliers in AveOccup: 711
Outliers in Latitude: 0
Outliers in Longitude: 0
Outliers in MedHouseVal: 1071
NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 2
MACHINE LEARNING LAB BCSL606
NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 3