Roll No- 41463 (LP-3)
Email Classification
Classify the email using binary classification method. Email Spam detection has two
states: a) Normal State Not Spam b) Abnormal State Spam. Use K-Nearest Neighbors and
Support Vector Machine for Classification. Analyze their performance.
Dataset used: https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv
(https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv)
In [1]: import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error
from sklearn.metrics import accuracy_score
In [2]: df = pd.read_csv("emails.csv")
df.head()
Out[2]:
Email
the to ect and for of a you hou ... connevey jay valued lay infrastructu
No.
Email
0 0 0 1 0 0 0 2 0 0 ... 0 0 0 0
1
Email
1 8 13 24 6 6 2 102 1 27 ... 0 0 0 0
2
Email
2 0 0 1 0 0 0 8 0 0 ... 0 0 0 0
3
Email
3 0 5 22 0 5 1 51 2 10 ... 0 0 0 0
4
Email
4 7 6 17 1 5 2 57 0 9 ... 0 0 0 0
5
5 rows × 3002 columns
In [3]: df.tail()
Out[3]:
Email
the to ect and for of a you hou ... connevey jay valued lay infrastru
No.
Email
5167 2 2 2 3 0 0 32 0 0 ... 0 0 0 0
5168
Email
5168 35 27 11 2 6 5 151 4 3 ... 0 0 0 0
5169
Email
5169 0 0 1 1 0 0 11 0 0 ... 0 0 0 0
5170
Email
5170 2 7 1 0 2 1 28 2 0 ... 0 0 0 0
5171
Email
5171 22 24 5 1 6 5 148 8 2 ... 0 0 0 0
5172
5 rows × 3002 columns
In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5172 entries, 0 to 5171
Columns: 3002 entries, Email No. to Prediction
dtypes: int64(3001), object(1)
memory usage: 118.5+ MB
In [5]: df.describe()
Out[5]:
the to ect and for of
count 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.00000
mean 6.640565 6.188128 5.143852 3.075599 3.124710 2.627030 55.51740
std 11.745009 9.534576 14.101142 6.045970 4.680522 6.229845 87.57417
min 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000
25% 0.000000 1.000000 1.000000 0.000000 1.000000 0.000000 12.00000
50% 3.000000 3.000000 1.000000 1.000000 2.000000 1.000000 28.00000
75% 8.000000 7.000000 4.000000 3.000000 4.000000 2.000000 62.25000
max 210.000000 132.000000 344.000000 89.000000 47.000000 77.000000 1898.00000
8 rows × 3001 columns
In [6]: df.isnull().sum()
Out[6]: Email No. 0
the 0
to 0
ect 0
and 0
for 0
of 0
a 0
you 0
hou 0
in 0
on 0
is 0
this 0
enron 0
i 0
be 0
that 0
will 0
have 0
with 0
your 0
at 0
we 0
s 0
are 0
it 0
by 0
com 0
as 0
..
decisions 0
produced 0
ended 0
greatest 0
degree 0
solmonson 0
imbalances 0
fall 0
fear 0
hate 0
fight 0
reallocated 0
debt 0
reform 0
australia 0
plain 0
prompt 0
remains 0
ifhsc 0
enhancements 0
connevey 0
jay 0
valued 0
lay 0
infrastructure 0
military 0
allowing 0
ff 0
dry 0
Prediction 0
Length: 3002, dtype: int64
Splitting Train and Test dataset
In [7]: x = df.iloc[:,1:3001]
y = df.iloc[:,-1].values
In [8]: x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2,
a) Using K-Nearest Neighbours
In [9]: knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
In [ ]:
Analyzing performance
In [10]: print("MSE: ", mean_squared_error(y_test, y_pred))
print("MAE: ", mean_absolute_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score: ", metrics.r2_score(y_test, y_pred))
print("Accuracy Score for KNN: ", accuracy_score(y_test, y_pred))
MSE: 0.12560386473429952
MAE: 0.12560386473429952
RMSE: 0.3544063553807966
R2 Score: 0.40780091899790494
Accuracy Score for KNN: 0.8743961352657005
b) Using Support Vector Machine(SVM)
In [11]: svc = SVC(C=1.0, gamma='auto', kernel='rbf')
svc.fit(x_test, y_test)
y_pred = svc.predict(x_test)
Analyzing Performance
In [12]: print("MSE: ", mean_squared_error(y_test, y_pred))
print("MAE: ", mean_absolute_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score: ", metrics.r2_score(y_test, y_pred))
print("Accuracy Score for KNN: ", accuracy_score(y_test, y_pred))
MSE: 0.07149758454106281
MAE: 0.07149758454106281
RMSE: 0.2673903224521464
R2 Score: 0.6629020615834228
Accuracy Score for KNN: 0.9285024154589372
In [ ]: