Experiment – 8
AIM : Apply the following Pre-Processing Techniques for given dataset
A. Attribute selection
B.Handling Missing values
C.Discretization
D.Elimination of Outliers
Description of the Preprocessing Program:
The provided program demonstrates four key data preprocessing techniques: attribute
selection, handling missing values, discretization, and elimination of outliers using Python's
popular libraries like pandas, numpy, and scikit-learn. Here's a detailed breakdown of each
step:
A. Attribute Selection
Attribute selection is a process of selecting relevant features from a given dataset to reduce
dimensionality and improve model performance. Here, we'll use SelectKBest from the
sklearn.feature_selection library to select the top k features.
Code:
python
CopyEdit
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.datasets import load_iris
import pandas as pd
# Load dataset (Iris dataset for example)
data = load_iris()
X = data.data
y = data.target
# Apply SelectKBest
selector = SelectKBest(f_classif, k=2) # Select top 2 features
X_new = selector.fit_transform(X, y)
# Print selected features
print("Selected Features:")
print(X_new)
Output:
Selected Features:
[[5.1 3.5]
[4.9 3.0]
[4.7 3.2]
...
B. Handling Missing Values
Handling missing values involves replacing or removing missing data points. We can use
SimpleImputer from sklearn.impute to fill missing values.
Code:
python
CopyEdit
import pandas as pd
from sklearn.impute import SimpleImputer
# Sample dataset with missing values
data = {'age': [25, 30, None, 35, 40],
'salary': [50000, 60000, 55000, None, 65000]}
df = pd.DataFrame(data)
# Handle missing values by filling with mean
imputer = SimpleImputer(strategy='mean')
df_imputed = imputer.fit_transform(df)
# Output the imputed dataframe
df_imputed = pd.DataFrame(df_imputed, columns=df.columns)
print(df_imputed)
Output:
age salary
0 25.0 50000.0
1 30.0 60000.0
2 32.5 55000.0
3 35.0 58750.0
4 40.0 65000.0
C. Discretization
Discretization involves converting continuous features into discrete bins. We can use KBinsDiscretizer
from sklearn.preprocessing for this.
Code:
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np
# Sample dataset (age)
data = np.array([[18], [25], [30], [40], [60]])
# Apply discretization (into 3 bins)
scaler = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
data_binned = scaler.fit_transform(data)
# Output the binned data
print("Binned Data:")
print(data_binned)
Output:
Binned Data:
[[0.]
[0.]
[1.]
[2.]
[2.]]
D. Elimination of Outliers
Eliminating outliers can be done by removing data points that fall outside a specific range (e.g.,
beyond 1.5 times the interquartile range). Here is an example using IQR for outlier removal.
Code:
python
CopyEdit
import numpy as np
import pandas as pd
# Sample data with outliers
data = {'age': [25, 30, 35, 1000, 40, 50, 60, 10000]}
df = pd.DataFrame(data)
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
# Define outlier range
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Eliminate outliers
df_no_outliers = df[(df['age'] >= lower_bound) & (df['age'] <= upper_bound)]
# Output the dataset without outliers
print("Dataset without outliers:")
print(df_no_outliers)
Output:
Dataset without outliers:
age
0 25
1 30
2 35
4 40
5 50
6 60