0% found this document useful (0 votes)
28 views5 pages

ML 8 Program

The document outlines an experiment focused on applying four data preprocessing techniques: attribute selection, handling missing values, discretization, and elimination of outliers using Python libraries. Each technique is demonstrated with sample code, including the use of SelectKBest for feature selection, SimpleImputer for filling missing values, KBinsDiscretizer for discretization, and IQR for outlier removal. The document provides detailed explanations and code snippets for each preprocessing step.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views5 pages

ML 8 Program

The document outlines an experiment focused on applying four data preprocessing techniques: attribute selection, handling missing values, discretization, and elimination of outliers using Python libraries. Each technique is demonstrated with sample code, including the use of SelectKBest for feature selection, SimpleImputer for filling missing values, KBinsDiscretizer for discretization, and IQR for outlier removal. The document provides detailed explanations and code snippets for each preprocessing step.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Experiment – 8

AIM : Apply the following Pre-Processing Techniques for given dataset


A. Attribute selection
B.Handling Missing values
C.Discretization
D.Elimination of Outliers
Description of the Preprocessing Program:
The provided program demonstrates four key data preprocessing techniques: attribute
selection, handling missing values, discretization, and elimination of outliers using Python's
popular libraries like pandas, numpy, and scikit-learn. Here's a detailed breakdown of each
step:

A. Attribute Selection

Attribute selection is a process of selecting relevant features from a given dataset to reduce
dimensionality and improve model performance. Here, we'll use SelectKBest from the
sklearn.feature_selection library to select the top k features.

Code:

python

CopyEdit

from sklearn.feature_selection import SelectKBest, f_classif

from sklearn.datasets import load_iris

import pandas as pd

# Load dataset (Iris dataset for example)

data = load_iris()

X = data.data

y = data.target

# Apply SelectKBest

selector = SelectKBest(f_classif, k=2) # Select top 2 features

X_new = selector.fit_transform(X, y)
# Print selected features

print("Selected Features:")

print(X_new)

Output:

Selected Features:

[[5.1 3.5]

[4.9 3.0]

[4.7 3.2]

...

B. Handling Missing Values

Handling missing values involves replacing or removing missing data points. We can use
SimpleImputer from sklearn.impute to fill missing values.

Code:

python

CopyEdit

import pandas as pd

from sklearn.impute import SimpleImputer

# Sample dataset with missing values

data = {'age': [25, 30, None, 35, 40],

'salary': [50000, 60000, 55000, None, 65000]}

df = pd.DataFrame(data)

# Handle missing values by filling with mean

imputer = SimpleImputer(strategy='mean')

df_imputed = imputer.fit_transform(df)

# Output the imputed dataframe

df_imputed = pd.DataFrame(df_imputed, columns=df.columns)

print(df_imputed)
Output:

age salary

0 25.0 50000.0

1 30.0 60000.0

2 32.5 55000.0

3 35.0 58750.0

4 40.0 65000.0

C. Discretization

Discretization involves converting continuous features into discrete bins. We can use KBinsDiscretizer
from sklearn.preprocessing for this.

Code:

from sklearn.preprocessing import KBinsDiscretizer

import numpy as np

# Sample dataset (age)

data = np.array([[18], [25], [30], [40], [60]])

# Apply discretization (into 3 bins)

scaler = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')

data_binned = scaler.fit_transform(data)

# Output the binned data

print("Binned Data:")

print(data_binned)

Output:

Binned Data:

[[0.]

[0.]

[1.]

[2.]

[2.]]
D. Elimination of Outliers

Eliminating outliers can be done by removing data points that fall outside a specific range (e.g.,
beyond 1.5 times the interquartile range). Here is an example using IQR for outlier removal.

Code:

python

CopyEdit

import numpy as np

import pandas as pd

# Sample data with outliers

data = {'age': [25, 30, 35, 1000, 40, 50, 60, 10000]}

df = pd.DataFrame(data)

# Calculate Q1 (25th percentile) and Q3 (75th percentile)

Q1 = df['age'].quantile(0.25)

Q3 = df['age'].quantile(0.75)

IQR = Q3 - Q1

# Define outlier range

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

# Eliminate outliers

df_no_outliers = df[(df['age'] >= lower_bound) & (df['age'] <= upper_bound)]

# Output the dataset without outliers

print("Dataset without outliers:")

print(df_no_outliers)
Output:

Dataset without outliers:

age

0 25

1 30

2 35

4 40

5 50

6 60

You might also like