0% found this document useful (0 votes)

19 views38 pages

Seven Lab Instruction

The document discusses various data normalization techniques in machine learning, including LP norms, Min-Max normalization, and Z-score normalization, emphasizing their importance in preprocessing data for classification algorithms. It also covers data cleaning methods using Pandas, such as handling missing values and one-hot encoding for categorical variables. Additionally, it explains how to read and save CSV files, as well as the differences between concatenating and joining DataFrames in Pandas.

Uploaded by

Mohd Saddam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views38 pages

Seven Lab Instruction

Uploaded by

Mohd Saddam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

26-02-2025

from sklearn.preprocessing import StandardScaler, MinMaxScaler,

Normalizer, normalize, OrdinalEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
Data Normalization
Reading csv in df
Simple data cleaning
L-p norm (normalize & Normalizer) for numeric data
StandardScalar
MinMax
OHE using get_dummies
OHE using OneHotEncoder class
Concat v/s join
Saving transformed df in csv for future use
Separating feature vector from Target label using df.iloc[row,col]
Train/Test split

Data Normalization
• Normalization is particularly useful for classification algorithms involving
neural networks or distance measurements such as nearest-neighbor
classification and clustering.
• If using the neural network backpropagation algorithm for classification,
normalizing the input values for each attribute measured in the training
tuples will help speed up the learning phase.
• For distance-based methods, normalization helps prevent attributes with
large ranges (e.g., income) from outweighing attributes with smaller
ranges (e.g., binary attributes).
• It is also useful when given no prior knowledge of the data.

1
26-02-2025

LP-Norm
• For normalizing data using LP norms, first we need to calculate their
norms.
• Norm of a vector represents some property of vector; e.g. L2 norm of a
vector denotes its length/magnitude.
• One important use of L2 norm is to transform a given vector into a unit-
length vector
• i.e., making the magnitude of vector = 1, while still preserving its direction.
• This is achieved by dividing each element in a vector by its length i.e. its
L2-norm.
• Another use of L1 & L2 norm:
• in computation of loss in regularized gradient descent algorithms.
• These are also used in famous Ridge and Lasso regression algorithms.

LP-Norm
• The p-norm is a norm on suitable real vector spaces given by the pth root of the sum (or
integral) of the pth-powers of the absolute values of the vector components.
• The general definition for the p-norm of a vector v that has N elements is:
• https://www.journaldev.com/45324/norm-of-vector-python

where p is any positive real value, Inf, or -Inf. Some common values of p are 1, 2,
and Inf.
•If p is 1, then the resulting 1-norm is the sum of the absolute values of the vector
elements.
•If p is 2, then the resulting 2-norm gives the vector magnitude or Euclidean
length of the vector.
•If p is Inf, then

2
26-02-2025

Normalisation using LP norm

• Consider a vector x=(1,2,3,4). Find 1-norm and 2-norm of this vector. And
do normalization using both LP-norm.
• For given vector: 1-norm = 10, 2-norm = sqrt(30)=5.47.
• 1-norm normalized vector can be obtained by (1/10,2/10,3/10,4/10).
• Adding each of the entries in the L1-normalized vector is equal 1.
This means we can interpret 1-norm as probability.
• For normalization using 2-norm, we will get the vector
(1/5.47,2/5.47,3/5.47,4/5.47)=(0.1826, 0.3651, 0.5477, 0.7303).
The 2-norm normalized vector can be interpreted as the tip of a direction
vector of the unit hypersphere, which starts at the origin.

Rescaling
• Rescaling changes the distance between the min and max values in a data
set by stretching or squeezing the points along the number line.

• The equation for rescaling data X to an arbitrary interval [a b] is:

3
26-02-2025

Example: Min-max normalization

• Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
• Eg. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to

73,600  12,000
(1.0  0)  0  0.716
98,000  12,000
• Normalization by decimal scaling
v
v' Where j is the smallest integer such that Max(|ν’|) < 1
10 j

Standard Normal Distribution

(or mean cantered distribution)
• A standard normal distribution has mean = 0 and standard deviation=1.

• Z-score normalization is useful in machine learning settings since it can tell

you:
• how far a data point is from the average of the whole data set.

• It can be most appropriate when there are just a few outliers, since it
provides a simple way to compare a data point to the norm.

• We calculate a z-score when comparing data sets that are likely to be

similar because of some genetic or experimental reason, e.g.
• a physical attribute of an animal or
• results within a certain time frame.

4
26-02-2025

Z-score
• z-scores measure the distance of a data point from the mean in
terms of the standard deviation.
• The standardized data set has mean=0 and standard deviation 1,
and retains the shape properties of the original data set.

Z-score
• The z-scores of the data are preserved, so the shape of the
distribution remains the same.
• E.g. v  A
v'
(μ: mean, σ: standard deviation):
 A
Ex. Let μ = 54,000, σ = 16,000. Then $73,000 is mapped to

73,600  54,000
 1.225
16,000

5
26-02-2025

normalize()
method
https://scikit-
learn.org/stable/modules/gen
erated/sklearn.preprocessing.n
ormalize.html

Reading data from csv

import pandas as pd
df = pd.read_csv('cars.csv')
• if your csv doesn't have a header, header=0, if first row has column headers
• to assign column names while reading a csv using pandas with no column header in it use
names=['','',....]
df1=pd.read_csv("iris.data",header=None,names=['sepal_len','sepal_wi
dth','petal_len','petal_width','class'])
• use this for csv with no header
df=pd.read_csv("test.csv", header=0) #default setting first row col header
header=0 → Speciﬁes that the ﬁrst row (index 0) should be used as the column headers. You can change it.
• saving an updated DataFrame back to csv
df_transf.to_csv
("cars_transf.csv",index=False)
• Pandas DataFrames have an implicit index, which is often just a sequence of numbers (0, 1, 2, ...). If index=False is
not set, Pandas will write this index as the first column in the CSV, which may not be needed.

6
26-02-2025

Simple Data Cleaning

#simple data cleaning step; set axis=0 for dropping rows with NAN
df.dropna(axis=0,inplace=True)
nan_count_per_column2 = df.isna().sum() #for these only
• By default, dropna() only removes rows containing NaN (None, np.nan) values.
• However, strings like '?', 'na', 'nil', ' ' are not automatically treated as NaN.
• For dropping Rows in Pandas using dropna() for specific strings (' ', '?', 'na', 'nil', etc.)

# Replace specific string values with NaN

df.replace(['?', 'na', 'nil', ' '], np.nan, inplace=True)
# Drop rows containing NaN in any column; default axis=0
df_cleaned = df.dropna()
# for column wise drop with NAN values
df.dropna(axis=1,inplace=True)

7
26-02-2025

Demo of dropna()
-It returns only those rows for which no
occurrences of np.nan

Demo
of
dropna()
-Column wise

8
26-02-2025

Demo of fillna()

-It returns dataframe with all

np.nan replaced by new
value here ‘dummy’
-Fill values feature wise by its
mean value

Demo of dropna()
and
fillna()
-By default inplace flag is set as
‘False’
-Hence all these fillna() or dropna()
won’t have any effect until you set
inplace flag as ‘True’
-After executing a number of fillna()
and dropna(), if you display your data
frame, you get that same old
dataframe.
-You need to re-issue these statements
again but with inplace flag reset.

9
26-02-2025

with
inplace
True

Demo of conditional
selection

column wise
selection

-It returns a subset of df with only those rows

for which feature ‘Z’ has values>0, or the
specified condition is true.
-It skips just one row ‘D’ with negative value of
‘Z’.

10
26-02-2025

Normalization

Normalization of Numeric type features only

• Pre-requisite
• # Selecting only numeric columns
numeric_df = df.select_dtypes(include=['number’])

• for inline column-wise Normalization using normalize

method
from sklearn.preprocessing import normalize
numeric_norm1 = normalize(df_numeric, norm='max', axis=0)
#axis=0 option suggests normalize data column wise with l1,
l2, max
https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html

11
26-02-2025

Column wise normalization using normalize() method

Normalizer
class
https://scikit-
learn.org/stable/modules/gen
erated/sklearn.preprocessing.
Normalizer.html

12
26-02-2025

Creating numpy array

Alternative methods for Feature/Column wise

scaling
• For column/feature-wise scaling, StandardScaler() or MinMaxScaler()
might be better options as they do it by deafult.

13
26-02-2025

Min-Max, Z-Score scaler (default: column wise)

14
26-02-2025

Using Scaler object on Test dataset using Transform

• Once these scaler objects are created,

and .fit() has been called on some
data, it learns simple statistical
measures such as min, max, mean,
median, standard deviation, etc from
data
• We can invoke .transform method on
unseen data to get it normalized using
trained scaler object

Normalizer
(default: row-wise)

15
26-02-2025

Max norm
row-wise

Using Normalizer class for column wise normalization

• Normalizer(norm='l2') works row-wise, meaning each row is
treated as an independent vector and normalized using L1, L2,
or Max norm.

• To apply it column-wise, you must transpose the data, apply

normalization, then transpose it back.

16
26-02-2025

Using Normalizer class for column wise normalization

Column wise Max normalization

17
26-02-2025

Comparison

LP norm
Normalizer class v/s normalize method

18
26-02-2025

OHE using Pandas

• Separate out numeric columns and
normalize them using any Lp
norm/Zscore/MinMax etc.
• Separate out categorical columns in another
DataFrame.
• Separate out ordinal columns in another
DataFrame.

Handling Numeric Columns

• Notice the size and dtype of output variable numeric_norm1

• Here its type is: NumPy array, so you need to typecast it in
DataFrame if you want to concatenate it with another df
holding outcome of OHE df.
• We can do that conveniently by using this statement:
• Check this:
df_numeric_norm1=pd.DataFrame(numeric_norm1,colu
mns=['km_driven','selling_price'])

19
26-02-2025

One Hot Encoding

Using PANDAS

Using get_dummies() method directly on full DataFrame

#we can control data type of get_dummy resultant DF using dtype option
df_fu_ow=pd.get_dummies(df,columns=['fuel','owner'],dtype=np.int32,drop_first=True)
Now we will need to OHE ‘brand’ and drop a few redundant columns before concatenating it with normalized df_numeric

20
26-02-2025

controlling data type of get_dummy resultant DF using dtype option

Method 2: Separating
out Categorical
columns

df_categorical = df.select_dtypes(include=['object',
'category’])

object → Includes string-based categorical columns.

category → Includes columns explicitly set as Pandas categorical data
types; which can be set while opening a csv using pd.

21
26-02-2025

Check how many features your OHE df will

have
df_fu_ow_brand=pd.get_dummies(df_catego
rical,dtype=np.int32,drop_first=True)
#we can control data type of resultant
matrix using dtype option
#it will drop first column or nominal
value to optimize on encoding of k
nominal values by using just k-1 values

• You must have an idea of how many unique values a categorical feature
holds; in order to know how many OHE columns it will generate.
• If you used ‘K-1’ OHE outcomes with option ‘drop_first=True’ for
columns with ‘K’ unique values, it will add ‘K-1’ columns in the resulting
DF corresponding to each categorical column in your df.

Using get_dummies on categorical columns

22
26-02-2025

Pandas
● Often the data you need exists in two separate sources,
fortunately, Pandas makes it easy to combine these
together.
● The simplest combination is if both sources are already in
the same format, then a concatenation through the
pd.concat() call is all that is needed.

Concatenating multiple df

23
26-02-2025

Concatenating multiple DataFrames

• You can concatenate two Pandas DataFrame by columns using the pd.concat() function with
axis=1.
df_combined = pd.concat([df1, df2], axis=1)
• Both df1 and df2 must have same rows but may have different columns to be put back to
back
• Handling Different Row Counts:
• If df1 and df2 have different numbers of rows, Pandas fills missing values with NaN.To reset
the index, use df_combined.reset_index(drop=True).

More on df.concat()
# Concatenating DataFrames row-wise (axis=0)
df_concat = pd.concat([df1, df2], ignore_index=True)
• ignore_index=True resets the index after stacking.
# Horizontal Concatenation (Merging Columns)
df_concat_cols = pd.concat([df1, df2], axis=1)
• Columns may get duplicated if they don’t have unique labels.

24
26-02-2025

df.join() v/s df.concat()

df.join() df.concat()
• Used for Database-Like Joins • Used for Concatenation
• Best for: Merging DataFrames Based • Best for: Stacking DataFrames
on IndexWorks like an SQL join. (Vertically or Horizontally)
• Merges only on the index (default • Can combine multiple DataFrames
behavior). along rows (axis=0) or columns
• Good for column-wise concatenation (axis=1).
for columns with same index • Works on both row index and
• Can specify different join types (left, columns.
right, inner, outer). • Does not require a common index.
• Does not work well for stacking rows. • Supports ignoring the index and
• E.g. reassigning a new one.
df_joined = df1.join(df2)
# Default is left join

Demo of df.join()
The same output you get
with this:
df_joined=
pd.concat([df1, df2],
axis=1)

25
26-02-2025

Changing the dtype of numeric_norm NumPy array to

be concatenated with df_categorical_ohe

Concatenating categorical OHE df with normalized numeric df

Here size of resultant df is 8128 X 40, first 38 columns are from OHE df on 3
categorical fields, and 2 numeric normalized columns

26
26-02-2025

Saving the pre-processed and normalized df in csv

for opening it next time for a data analysis

Another method for OHE for

saving on dimensionality i.e.
reducing count of features

27
26-02-2025

Demo on ‘brand’ feature of Cars.csv

• If a particular feature (e.g. ‘brand’ with 32 unique values in cars.csv) has more than 10 unique values
• Can be handled separately to save on dimensions in the resultant one-hot encoded df

Demo on
‘brand’
feature of
Cars.csv

28
26-02-2025

Demo on
‘brand’
feature of
Cars.csv
Count of columns in
resultant df=

32-23 = 9
You saved on
dimension of data
without
compromising much
on data quality.

controling data
type of
get_dummies
resultant DF
using dtype
option

29
26-02-2025

OHE demo using OneHotEncoder

class

• One-Hot Encoding (Nominal OneHotEncoder class

Categories)
• Best for: Unordered import pandas as pd
categories (e.g., Colors, from sklearn.preprocessing import OneHotEncoder
Cities, Product Types)
df = pd.DataFrame({'City': ['New York', 'Los Angeles',
• One-hot encoding creates 'Chicago', 'Houston', 'Chicago', 'Houston']})
binary (0/1) columns for
each category. encoder = OneHotEncoder(sparse=False, drop='first’)
# drop='first' to avoid dummy # drop='first' to avoid dummy variable trap
variable trap encoded_data = encoder.fit_transform(df[['City']])
#sparse=False parameter was # Convert to DataFrame by typecasting
deprecated in df_encoded = pd.DataFrame(encoded_data,
sklearn.preprocessing.OneHot columns=encoder.get_feature_names_out(['City']))
Encoder starting from Scikit-
Learn v1.2.; instead use df = df.join(df_encoded)
sparse_output=False print(df)

30
26-02-2025

OHE demo

31
26-02-2025

Ordinal Encoding
Using Pandas

Consider this df with only ordinal columns

32
26-02-2025

Using df[‘Ordinal_col_name’].cat.codes
•Identify ordinal columns manually or use a predefined dictionary.
•Convert them to categorical dtype with ordered categories.
•Apply ordinal encoding using .cat.codes.

•Education Encoding:
•High School → 0, Bachelor → 1, Master → 2, PhD → 3
•Satisfaction Encoding:
•Very Dissa sfied → 0, Dissa sfied → 1, Neutral → 2, Sa sfied → 3, Very Sa sfied → 4

• This approach ensures that the ordinal nature of the data is preserved.

import pandas as pd

# Sample DataFrame
Using df.cat.codes
df = pd.DataFrame({
'Education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master'],
'Satisfaction': ['Neutral', 'Satisfied', 'Very Satisfied', 'Dissatisfied', 'Very
Dissatisfied', 'Neutral']
})

# Define the order for ordinal encoding

education_order = pd.CategoricalDtype(categories=['High School', 'Bachelor', 'Master',
'PhD'], ordered=True)
satisfaction_order = pd.CategoricalDtype(categories=['Very Dissatisfied', 'Dissatisfied',
'Neutral', 'Satisfied', 'Very Satisfied'], ordered=True)

# Convert columns to categorical type

df['Education'] = df['Education'].astype(education_order)
df['Satisfaction'] = df['Satisfaction'].astype(satisfaction_order)

# Convert to numeric encoding

df['Education_Encoded'] = df['Education'].cat.codes
df['Satisfaction_Encoded'] = df['Satisfaction'].cat.codes

print(df)

33
26-02-2025

Use OE method by assigning order

from sklearn.preprocessing
import OrdinalEncoder
Alternate OHE method using OrdinalEncoder class

34
26-02-2025

OrdinalEncoder class
import pandas as pd
• Best for: Features with from sklearn.preprocessing import OrdinalEncoder
an inherent order (e.g.,
Education Level, df = pd.DataFrame({'Education': ['High School',
'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master']})
Satisfaction)
• This assigns integer
values such as 0, 1, 2, # Define ordinal categories
education_order = [['High School', 'Bachelor',
3………based on a 'Master', 'PhD']]
predefined order.
# Apply ordinal encoding
encoder = OrdinalEncoder(categories=education_order)
df['Education_Encoded'] =
encoder.fit_transform(df[['Education']])
print(df)

OrdinalEncoder demo

35
26-02-2025

OrdinalEncoder demo

Separating Feature vectors from label or

target column
Y=df_transf.iloc[:,-1] # assuming last column is
output label selling price to be predicted for
cars.csv

X=df_transf.iloc[:,0:-1] # same as 0:18 (index 18

excluded if your data has 18 feature columns)

36
26-02-2025

Train/Test split for classification model

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test =
train_test_split(X,Y,test_size=0.2,random_state=82, stratify=Y)

#test_size=0.2 separates 20% of data as Test dataset

When to Use stratify?

• Classification problems where the target variable has imbalanced classes.
• When dataset has rare classes that might be missing in train/test without stratification.
• To ensure better generalization for machine learning models.
• Ensures that all target label classes appear in the same proportion in the Test dataset as
the original dataset.

Option random_state
• The random_state parameter in Scikit-Learn ensures reproducibility
by controlling the randomness in various functions like
train_test_split(), KMeans(), RandomForestClassifier(), etc.
Why Use random_state?
• Reproducibility: Ensures that the results remain the same every time
you run the code.
• Consistency: Useful when comparing models to ensure they are
trained on the same Train/Test data split.
• Debugging: Helps track and debug issues by keeping the same
random selections.

37
26-02-2025

Train/Test split

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test =
train_test_split(X,Y,test_size=0.2,random_state=82)

#stratify not needed and available for regression problems

#or a dataset with continuous valued Target label

Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Data Normalization Machine Learning
No ratings yet
Data Normalization Machine Learning
5 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
ML - Week 04
No ratings yet
ML - Week 04
33 pages
ML Normalization Techniques - Overview & Practical Guide
No ratings yet
ML Normalization Techniques - Overview & Practical Guide
5 pages
Week 10
No ratings yet
Week 10
50 pages
Data Preparation
No ratings yet
Data Preparation
11 pages
ML Lab Exam Document
No ratings yet
ML Lab Exam Document
14 pages
Normalization: Normalization Techniques at A Glance
No ratings yet
Normalization: Normalization Techniques at A Glance
5 pages
Ap Python
No ratings yet
Ap Python
12 pages
Data Normalization in Data Mining
No ratings yet
Data Normalization in Data Mining
8 pages
ML - Lab Manual
No ratings yet
ML - Lab Manual
54 pages
Standar Ization
No ratings yet
Standar Ization
7 pages
5 Preprocessing
No ratings yet
5 Preprocessing
44 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
04 - Data Normalization in Python - en
No ratings yet
04 - Data Normalization in Python - en
1 page
Data Preprocessing PT 2
No ratings yet
Data Preprocessing PT 2
7 pages
Feature Scaling Techniques: Machine Learning
No ratings yet
Feature Scaling Techniques: Machine Learning
27 pages
Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Résumé-Analyse Des Données Resumee Resumee
No ratings yet
Résumé-Analyse Des Données Resumee Resumee
4 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
Preprocessing
No ratings yet
Preprocessing
5 pages
Lecture 10 - Data Transformation-M
No ratings yet
Lecture 10 - Data Transformation-M
8 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
Feature Scaling in Machine Learning
No ratings yet
Feature Scaling in Machine Learning
4 pages
Standardization Campusx
No ratings yet
Standardization Campusx
4 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
EE2211 CheatSheet
No ratings yet
EE2211 CheatSheet
15 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Well Posed Learning Problem
100% (1)
Well Posed Learning Problem
4 pages
Data Normalization
No ratings yet
Data Normalization
7 pages
ML Lec 4
No ratings yet
ML Lec 4
9 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
Unit-II Feature Engineering - Removed
No ratings yet
Unit-II Feature Engineering - Removed
158 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
L06 Features
No ratings yet
L06 Features
44 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Lecture 2.3 Data Normalization
No ratings yet
Lecture 2.3 Data Normalization
7 pages
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
No ratings yet
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
6 pages
Scaling Techniques
No ratings yet
Scaling Techniques
30 pages
8 Normalization Methods
No ratings yet
8 Normalization Methods
10 pages
ML Distance
No ratings yet
ML Distance
18 pages
Feature Scaling (Standardization & Normalization)
No ratings yet
Feature Scaling (Standardization & Normalization)
35 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Prac 4 B Z-Score Students DMDW Lab Manual
No ratings yet
Prac 4 B Z-Score Students DMDW Lab Manual
6 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
65 pages
EN3150 Homework 01
No ratings yet
EN3150 Homework 01
2 pages
21BDS0357 VL2024250504577 Ast02
No ratings yet
21BDS0357 VL2024250504577 Ast02
5 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages

Seven Lab Instruction

Uploaded by

Seven Lab Instruction

Uploaded by

26-02-2025

from sklearn.preprocessing import StandardScaler, MinMaxScaler,

Normalisation using LP norm

• The equation for rescaling data X to an arbitrary interval [a b] is:

Example: Min-max normalization

Standard Normal Distribution

• Z-score normalization is useful in machine learning settings since it can tell

• We calculate a z-score when comparing data sets that are likely to be

Reading data from csv

Simple Data Cleaning

Simple Data Cleaning

# Replace specific string values with NaN

-It returns dataframe with all

-It returns a subset of df with only those rows

Normalization of Numeric type features only

• for inline column-wise Normalization using normalize

Column wise normalization using normalize() method

Creating numpy array

Alternative methods for Feature/Column wise

Min-Max, Z-Score scaler (default: column wise)

Min-Max, Z-Score scaler (default: column wise)

Using Scaler object on Test dataset using Transform

• Once these scaler objects are created,

Using Normalizer class for column wise normalization

• To apply it column-wise, you must transpose the data, apply

Using Normalizer class for column wise normalization

Column wise Max normalization

OHE using Pandas

Handling Numeric Columns

• Notice the size and dtype of output variable numeric_norm1

One Hot Encoding

Using get_dummies() method directly on full DataFrame

controlling data type of get_dummy resultant DF using dtype option

object → Includes string-based categorical columns.

Check how many features your OHE df will

Using get_dummies on categorical columns

Concatenating multiple DataFrames

df.join() v/s df.concat()

Changing the dtype of numeric_norm NumPy array to

Concatenating categorical OHE df with normalized numeric df

Saving the pre-processed and normalized df in csv

Another method for OHE for

Demo on ‘brand’ feature of Cars.csv

OHE demo using OneHotEncoder

• One-Hot Encoding (Nominal OneHotEncoder class

Consider this df with only ordinal columns

# Define the order for ordinal encoding

# Convert columns to categorical type

# Convert to numeric encoding

Use OE method by assigning order

Separating Feature vectors from label or

X=df_transf.iloc[:,0:-1] # same as 0:18 (index 18

Train/Test split for classification model

#test_size=0.2 separates 20% of data as Test dataset

When to Use stratify?

from sklearn.model_selection import train_test_split

#stratify not needed and available for regression problems

#or a dataset with continuous valued Target label

You might also like