0% found this document useful (0 votes)
19 views38 pages

Seven Lab Instruction

The document discusses various data normalization techniques in machine learning, including LP norms, Min-Max normalization, and Z-score normalization, emphasizing their importance in preprocessing data for classification algorithms. It also covers data cleaning methods using Pandas, such as handling missing values and one-hot encoding for categorical variables. Additionally, it explains how to read and save CSV files, as well as the differences between concatenating and joining DataFrames in Pandas.

Uploaded by

Mohd Saddam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views38 pages

Seven Lab Instruction

The document discusses various data normalization techniques in machine learning, including LP norms, Min-Max normalization, and Z-score normalization, emphasizing their importance in preprocessing data for classification algorithms. It also covers data cleaning methods using Pandas, such as handling missing values and one-hot encoding for categorical variables. Additionally, it explains how to read and save CSV files, as well as the differences between concatenating and joining DataFrames in Pandas.

Uploaded by

Mohd Saddam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

26-02-2025

from sklearn.preprocessing import StandardScaler, MinMaxScaler,


Normalizer, normalize, OrdinalEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
Data Normalization
Reading csv in df
Simple data cleaning
L-p norm (normalize & Normalizer) for numeric data
StandardScalar
MinMax
OHE using get_dummies
OHE using OneHotEncoder class
Concat v/s join
Saving transformed df in csv for future use
Separating feature vector from Target label using df.iloc[row,col]
Train/Test split

Data Normalization
• Normalization is particularly useful for classification algorithms involving
neural networks or distance measurements such as nearest-neighbor
classification and clustering.
• If using the neural network backpropagation algorithm for classification,
normalizing the input values for each attribute measured in the training
tuples will help speed up the learning phase.
• For distance-based methods, normalization helps prevent attributes with
large ranges (e.g., income) from outweighing attributes with smaller
ranges (e.g., binary attributes).
• It is also useful when given no prior knowledge of the data.

1
26-02-2025

LP-Norm
• For normalizing data using LP norms, first we need to calculate their
norms.
• Norm of a vector represents some property of vector; e.g. L2 norm of a
vector denotes its length/magnitude.
• One important use of L2 norm is to transform a given vector into a unit-
length vector
• i.e., making the magnitude of vector = 1, while still preserving its direction.
• This is achieved by dividing each element in a vector by its length i.e. its
L2-norm.
• Another use of L1 & L2 norm:
• in computation of loss in regularized gradient descent algorithms.
• These are also used in famous Ridge and Lasso regression algorithms.

LP-Norm
• The p-norm is a norm on suitable real vector spaces given by the pth root of the sum (or
integral) of the pth-powers of the absolute values of the vector components.
• The general definition for the p-norm of a vector v that has N elements is:
• https://www.journaldev.com/45324/norm-of-vector-python

where p is any positive real value, Inf, or -Inf. Some common values of p are 1, 2,
and Inf.
•If p is 1, then the resulting 1-norm is the sum of the absolute values of the vector
elements.
•If p is 2, then the resulting 2-norm gives the vector magnitude or Euclidean
length of the vector.
•If p is Inf, then

2
26-02-2025

Normalisation using LP norm


• Consider a vector x=(1,2,3,4). Find 1-norm and 2-norm of this vector. And
do normalization using both LP-norm.
• For given vector: 1-norm = 10, 2-norm = sqrt(30)=5.47.
• 1-norm normalized vector can be obtained by (1/10,2/10,3/10,4/10).
• Adding each of the entries in the L1-normalized vector is equal 1.
This means we can interpret 1-norm as probability.
• For normalization using 2-norm, we will get the vector
(1/5.47,2/5.47,3/5.47,4/5.47)=(0.1826, 0.3651, 0.5477, 0.7303).
The 2-norm normalized vector can be interpreted as the tip of a direction
vector of the unit hypersphere, which starts at the origin.

Rescaling
• Rescaling changes the distance between the min and max values in a data
set by stretching or squeezing the points along the number line.

• The equation for rescaling data X to an arbitrary interval [a b] is:

3
26-02-2025

Example: Min-max normalization


• Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
• Eg. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to

73,600  12,000
(1.0  0)  0  0.716
98,000  12,000
• Normalization by decimal scaling
v
v' Where j is the smallest integer such that Max(|ν’|) < 1
10 j

Standard Normal Distribution


(or mean cantered distribution)
• A standard normal distribution has mean = 0 and standard deviation=1.

• Z-score normalization is useful in machine learning settings since it can tell


you:
• how far a data point is from the average of the whole data set.

• It can be most appropriate when there are just a few outliers, since it
provides a simple way to compare a data point to the norm.

• We calculate a z-score when comparing data sets that are likely to be


similar because of some genetic or experimental reason, e.g.
• a physical attribute of an animal or
• results within a certain time frame.

4
26-02-2025

Z-score
• z-scores measure the distance of a data point from the mean in
terms of the standard deviation.
• The standardized data set has mean=0 and standard deviation 1,
and retains the shape properties of the original data set.

Z-score
• The z-scores of the data are preserved, so the shape of the
distribution remains the same.
• E.g. v  A
v'
(μ: mean, σ: standard deviation):
 A
Ex. Let μ = 54,000, σ = 16,000. Then $73,000 is mapped to

73,600  54,000
 1.225
16,000

5
26-02-2025

normalize()
method
https://scikit-
learn.org/stable/modules/gen
erated/sklearn.preprocessing.n
ormalize.html

Reading data from csv


import pandas as pd
df = pd.read_csv('cars.csv')
• if your csv doesn't have a header, header=0, if first row has column headers
• to assign column names while reading a csv using pandas with no column header in it use
names=['','',....]
df1=pd.read_csv("iris.data",header=None,names=['sepal_len','sepal_wi
dth','petal_len','petal_width','class'])
• use this for csv with no header
df=pd.read_csv("test.csv", header=0) #default setting first row col header
header=0 → Specifies that the first row (index 0) should be used as the column headers. You can change it.
• saving an updated DataFrame back to csv
df_transf.to_csv
("cars_transf.csv",index=False)
• Pandas DataFrames have an implicit index, which is often just a sequence of numbers (0, 1, 2, ...). If index=False is
not set, Pandas will write this index as the first column in the CSV, which may not be needed.

6
26-02-2025

Simple Data Cleaning

Simple Data Cleaning


#simple data cleaning step; set axis=0 for dropping rows with NAN
df.dropna(axis=0,inplace=True)
nan_count_per_column2 = df.isna().sum() #for these only
• By default, dropna() only removes rows containing NaN (None, np.nan) values.
• However, strings like '?', 'na', 'nil', ' ' are not automatically treated as NaN.
• For dropping Rows in Pandas using dropna() for specific strings (' ', '?', 'na', 'nil', etc.)

# Replace specific string values with NaN


df.replace(['?', 'na', 'nil', ' '], np.nan, inplace=True)
# Drop rows containing NaN in any column; default axis=0
df_cleaned = df.dropna()
# for column wise drop with NAN values
df.dropna(axis=1,inplace=True)

7
26-02-2025

Demo of dropna()
-It returns only those rows for which no
occurrences of np.nan

Demo
of
dropna()
-Column wise

8
26-02-2025

Demo of fillna()

-It returns dataframe with all


np.nan replaced by new
value here ‘dummy’
-Fill values feature wise by its
mean value

Demo of dropna()
and
fillna()
-By default inplace flag is set as
‘False’
-Hence all these fillna() or dropna()
won’t have any effect until you set
inplace flag as ‘True’
-After executing a number of fillna()
and dropna(), if you display your data
frame, you get that same old
dataframe.
-You need to re-issue these statements
again but with inplace flag reset.

9
26-02-2025

with
inplace
True

Demo of conditional
selection

column wise
selection

-It returns a subset of df with only those rows


for which feature ‘Z’ has values>0, or the
specified condition is true.
-It skips just one row ‘D’ with negative value of
‘Z’.

10
26-02-2025

Normalization

Normalization of Numeric type features only


• Pre-requisite
• # Selecting only numeric columns
numeric_df = df.select_dtypes(include=['number’])

• for inline column-wise Normalization using normalize


method
from sklearn.preprocessing import normalize
numeric_norm1 = normalize(df_numeric, norm='max', axis=0)
#axis=0 option suggests normalize data column wise with l1,
l2, max
https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html

11
26-02-2025

Column wise normalization using normalize() method

Normalizer
class
https://scikit-
learn.org/stable/modules/gen
erated/sklearn.preprocessing.
Normalizer.html

12
26-02-2025

Creating numpy array

Alternative methods for Feature/Column wise


scaling
• For column/feature-wise scaling, StandardScaler() or MinMaxScaler()
might be better options as they do it by deafult.

13
26-02-2025

Min-Max, Z-Score scaler (default: column wise)

Min-Max, Z-Score scaler (default: column wise)

14
26-02-2025

Using Scaler object on Test dataset using Transform

• Once these scaler objects are created,


and .fit() has been called on some
data, it learns simple statistical
measures such as min, max, mean,
median, standard deviation, etc from
data
• We can invoke .transform method on
unseen data to get it normalized using
trained scaler object

Normalizer
(default: row-wise)

15
26-02-2025

Max norm
row-wise

Using Normalizer class for column wise normalization


• Normalizer(norm='l2') works row-wise, meaning each row is
treated as an independent vector and normalized using L1, L2,
or Max norm.

• To apply it column-wise, you must transpose the data, apply


normalization, then transpose it back.

16
26-02-2025

Using Normalizer class for column wise normalization

Column wise Max normalization

17
26-02-2025

Comparison

LP norm
Normalizer class v/s normalize method

18
26-02-2025

OHE using Pandas


• Separate out numeric columns and
normalize them using any Lp
norm/Zscore/MinMax etc.
• Separate out categorical columns in another
DataFrame.
• Separate out ordinal columns in another
DataFrame.

Handling Numeric Columns

• Notice the size and dtype of output variable numeric_norm1


• Here its type is: NumPy array, so you need to typecast it in
DataFrame if you want to concatenate it with another df
holding outcome of OHE df.
• We can do that conveniently by using this statement:
• Check this:
df_numeric_norm1=pd.DataFrame(numeric_norm1,colu
mns=['km_driven','selling_price'])

19
26-02-2025

One Hot Encoding


Using PANDAS

Using get_dummies() method directly on full DataFrame

#we can control data type of get_dummy resultant DF using dtype option
df_fu_ow=pd.get_dummies(df,columns=['fuel','owner'],dtype=np.int32,drop_first=True)
Now we will need to OHE ‘brand’ and drop a few redundant columns before concatenating it with normalized df_numeric

20
26-02-2025

controlling data type of get_dummy resultant DF using dtype option

Method 2: Separating
out Categorical
columns

df_categorical = df.select_dtypes(include=['object',
'category’])

object → Includes string-based categorical columns.


category → Includes columns explicitly set as Pandas categorical data
types; which can be set while opening a csv using pd.

21
26-02-2025

Check how many features your OHE df will


have
df_fu_ow_brand=pd.get_dummies(df_catego
rical,dtype=np.int32,drop_first=True)
#we can control data type of resultant
matrix using dtype option
#it will drop first column or nominal
value to optimize on encoding of k
nominal values by using just k-1 values

• You must have an idea of how many unique values a categorical feature
holds; in order to know how many OHE columns it will generate.
• If you used ‘K-1’ OHE outcomes with option ‘drop_first=True’ for
columns with ‘K’ unique values, it will add ‘K-1’ columns in the resulting
DF corresponding to each categorical column in your df.

Using get_dummies on categorical columns

22
26-02-2025

Pandas
● Often the data you need exists in two separate sources,
fortunately, Pandas makes it easy to combine these
together.
● The simplest combination is if both sources are already in
the same format, then a concatenation through the
pd.concat() call is all that is needed.

Concatenating multiple df

23
26-02-2025

Concatenating multiple DataFrames


• You can concatenate two Pandas DataFrame by columns using the pd.concat() function with
axis=1.
df_combined = pd.concat([df1, df2], axis=1)
• Both df1 and df2 must have same rows but may have different columns to be put back to
back
• Handling Different Row Counts:
• If df1 and df2 have different numbers of rows, Pandas fills missing values with NaN.To reset
the index, use df_combined.reset_index(drop=True).

More on df.concat()
# Concatenating DataFrames row-wise (axis=0)
df_concat = pd.concat([df1, df2], ignore_index=True)
• ignore_index=True resets the index after stacking.
# Horizontal Concatenation (Merging Columns)
df_concat_cols = pd.concat([df1, df2], axis=1)
• Columns may get duplicated if they don’t have unique labels.

24
26-02-2025

df.join() v/s df.concat()


df.join() df.concat()
• Used for Database-Like Joins • Used for Concatenation
• Best for: Merging DataFrames Based • Best for: Stacking DataFrames
on IndexWorks like an SQL join. (Vertically or Horizontally)
• Merges only on the index (default • Can combine multiple DataFrames
behavior). along rows (axis=0) or columns
• Good for column-wise concatenation (axis=1).
for columns with same index • Works on both row index and
• Can specify different join types (left, columns.
right, inner, outer). • Does not require a common index.
• Does not work well for stacking rows. • Supports ignoring the index and
• E.g. reassigning a new one.
df_joined = df1.join(df2)
# Default is left join

Demo of df.join()
The same output you get
with this:
df_joined=
pd.concat([df1, df2],
axis=1)

25
26-02-2025

Changing the dtype of numeric_norm NumPy array to


be concatenated with df_categorical_ohe

Concatenating categorical OHE df with normalized numeric df

Here size of resultant df is 8128 X 40, first 38 columns are from OHE df on 3
categorical fields, and 2 numeric normalized columns

26
26-02-2025

Saving the pre-processed and normalized df in csv


for opening it next time for a data analysis

Another method for OHE for


saving on dimensionality i.e.
reducing count of features

27
26-02-2025

Demo on ‘brand’ feature of Cars.csv


• If a particular feature (e.g. ‘brand’ with 32 unique values in cars.csv) has more than 10 unique values
• Can be handled separately to save on dimensions in the resultant one-hot encoded df

Demo on
‘brand’
feature of
Cars.csv

28
26-02-2025

Demo on
‘brand’
feature of
Cars.csv
Count of columns in
resultant df=

32-23 = 9
You saved on
dimension of data
without
compromising much
on data quality.

controling data
type of
get_dummies
resultant DF
using dtype
option

29
26-02-2025

OHE demo using OneHotEncoder


class

• One-Hot Encoding (Nominal OneHotEncoder class


Categories)
• Best for: Unordered import pandas as pd
categories (e.g., Colors, from sklearn.preprocessing import OneHotEncoder
Cities, Product Types)
df = pd.DataFrame({'City': ['New York', 'Los Angeles',
• One-hot encoding creates 'Chicago', 'Houston', 'Chicago', 'Houston']})
binary (0/1) columns for
each category. encoder = OneHotEncoder(sparse=False, drop='first’)
# drop='first' to avoid dummy # drop='first' to avoid dummy variable trap
variable trap encoded_data = encoder.fit_transform(df[['City']])
#sparse=False parameter was # Convert to DataFrame by typecasting
deprecated in df_encoded = pd.DataFrame(encoded_data,
sklearn.preprocessing.OneHot columns=encoder.get_feature_names_out(['City']))
Encoder starting from Scikit-
Learn v1.2.; instead use df = df.join(df_encoded)
sparse_output=False print(df)

30
26-02-2025

OHE demo

OHE demo

31
26-02-2025

Ordinal Encoding
Using Pandas

Consider this df with only ordinal columns

32
26-02-2025

Using df[‘Ordinal_col_name’].cat.codes
•Identify ordinal columns manually or use a predefined dictionary.
•Convert them to categorical dtype with ordered categories.
•Apply ordinal encoding using .cat.codes.

•Education Encoding:
•High School → 0, Bachelor → 1, Master → 2, PhD → 3
•Satisfaction Encoding:
•Very Dissa sfied → 0, Dissa sfied → 1, Neutral → 2, Sa sfied → 3, Very Sa sfied → 4

• This approach ensures that the ordinal nature of the data is preserved.

import pandas as pd

# Sample DataFrame
Using df.cat.codes
df = pd.DataFrame({
'Education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master'],
'Satisfaction': ['Neutral', 'Satisfied', 'Very Satisfied', 'Dissatisfied', 'Very
Dissatisfied', 'Neutral']
})

# Define the order for ordinal encoding


education_order = pd.CategoricalDtype(categories=['High School', 'Bachelor', 'Master',
'PhD'], ordered=True)
satisfaction_order = pd.CategoricalDtype(categories=['Very Dissatisfied', 'Dissatisfied',
'Neutral', 'Satisfied', 'Very Satisfied'], ordered=True)

# Convert columns to categorical type


df['Education'] = df['Education'].astype(education_order)
df['Satisfaction'] = df['Satisfaction'].astype(satisfaction_order)

# Convert to numeric encoding


df['Education_Encoded'] = df['Education'].cat.codes
df['Satisfaction_Encoded'] = df['Satisfaction'].cat.codes

print(df)

33
26-02-2025

Use OE method by assigning order

from sklearn.preprocessing
import OrdinalEncoder
Alternate OHE method using OrdinalEncoder class

34
26-02-2025

OrdinalEncoder class
import pandas as pd
• Best for: Features with from sklearn.preprocessing import OrdinalEncoder
an inherent order (e.g.,
Education Level, df = pd.DataFrame({'Education': ['High School',
'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master']})
Satisfaction)
• This assigns integer
values such as 0, 1, 2, # Define ordinal categories
education_order = [['High School', 'Bachelor',
3………based on a 'Master', 'PhD']]
predefined order.
# Apply ordinal encoding
encoder = OrdinalEncoder(categories=education_order)
df['Education_Encoded'] =
encoder.fit_transform(df[['Education']])
print(df)

OrdinalEncoder demo

35
26-02-2025

OrdinalEncoder demo

Separating Feature vectors from label or


target column
Y=df_transf.iloc[:,-1] # assuming last column is
output label selling price to be predicted for
cars.csv

X=df_transf.iloc[:,0:-1] # same as 0:18 (index 18


excluded if your data has 18 feature columns)

36
26-02-2025

Train/Test split for classification model


from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test =
train_test_split(X,Y,test_size=0.2,random_state=82, stratify=Y)

#test_size=0.2 separates 20% of data as Test dataset

When to Use stratify?


• Classification problems where the target variable has imbalanced classes.
• When dataset has rare classes that might be missing in train/test without stratification.
• To ensure better generalization for machine learning models.
• Ensures that all target label classes appear in the same proportion in the Test dataset as
the original dataset.

Option random_state
• The random_state parameter in Scikit-Learn ensures reproducibility
by controlling the randomness in various functions like
train_test_split(), KMeans(), RandomForestClassifier(), etc.
Why Use random_state?
• Reproducibility: Ensures that the results remain the same every time
you run the code.
• Consistency: Useful when comparing models to ensure they are
trained on the same Train/Test data split.
• Debugging: Helps track and debug issues by keeping the same
random selections.

37
26-02-2025

Train/Test split

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test =
train_test_split(X,Y,test_size=0.2,random_state=82)

#stratify not needed and available for regression problems

#or a dataset with continuous valued Target label

38

You might also like