0% found this document useful (0 votes)
117 views3 pages

ML Program No.1

The document outlines a program to analyze the California Housing dataset by creating histograms and box plots for all numerical features to assess their distributions and identify outliers. It includes code for loading the dataset, plotting histograms and box plots, and calculating outliers using the IQR method. The output lists the number of outliers found in each numerical feature.

Uploaded by

Sunil as
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views3 pages

ML Program No.1

The document outlines a program to analyze the California Housing dataset by creating histograms and box plots for all numerical features to assess their distributions and identify outliers. It includes code for loading the dataset, plotting histograms and box plots, and calculating outliers using the IQR method. The output lists the number of outliers found in each numerical feature.

Uploaded by

Sunil as
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

MACHINE LEARNING LAB BCSL606

Program No 1

Develop a program to create histograms for all numerical features and analyze
the distribution of each feature. Generate box plots for all numerical features
and identify any outliers. Use California Housing dataset.

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import fetch_california_housing

# Load the dataset

california_housing = fetch_california_housing()

data = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)

data['MedHouseVal'] = california_housing.target # Add the target variable to the dataframe

# Plot histograms for all numerical features

data.hist(bins=30, figsize=(15, 10))

plt.suptitle('Histograms of Numerical Features')

plt.show()

# Plot box plots for all numerical features

plt.figure(figsize=(15, 10))

num_features = len(data.columns)

for i, column in enumerate(data.columns, 1): # Start index from 1 for subplot

plt.subplot((num_features // 3) + 1, 3, i) # Adjust grid size dynamically

sns.boxplot(y=data[column])

plt.title(column)

plt.tight_layout()

plt.suptitle('Box Plots of Numerical Features', y=1.02) # Adjust position to avoid overlap

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 1


MACHINE LEARNING LAB BCSL606

plt.show()

# Identify outliers using the IQR method

for column in data.columns:

Q1 = data[column].quantile(0.25)

Q3 = data[column].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]

print(f"Outliers in {column}: {len(outliers)}")

Output:

Outliers in MedInc: 681

Outliers in HouseAge: 0

Outliers in AveRooms: 511

Outliers in AveBedrms: 1424

Outliers in Population: 1196

Outliers in AveOccup: 711

Outliers in Latitude: 0

Outliers in Longitude: 0

Outliers in MedHouseVal: 1071

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 2


MACHINE LEARNING LAB BCSL606

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 3

You might also like