0% found this document useful (0 votes)

2 views7 pages

ML PR 2-1

The document outlines an assignment for a Laboratory Practice course in Computer Engineering, focusing on email classification using binary classification methods, specifically K-Nearest Neighbors and Support Vector Machine algorithms. It provides a dataset description, objectives, prerequisites, and theoretical content related to data preprocessing, binary classification, and model training/testing procedures. Additionally, it includes assignment questions to assess understanding of the concepts discussed.

Uploaded by

Shrushti Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views7 pages

ML PR 2-1

Uploaded by

Shrushti Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 7

Department of Computer Engineering Course: Laboratory Practice-III

Dated Sign of
Coding Timely
Answer Viva Total Subject
Efficiency Completion
Teacher

5 5 5 5 20

Expected Date of Completion:...................... Actual Date of Completion:......................

—------------------------------------------------------------------------------

-------- Group B

Assignment No:2
—--------------------------------------------------------------------------------

Title of the Assignment: Classify the email using the binary classification method. Email
Spam detection has two states:
a) Normal State – Not Spam,
b) Abnormal State – Spam.
Use K-Nearest Neighbors and Support Vector Machine for classification. Analyze
their performance.

Dataset Description: The csv file contains 5172 rows, each row for each email. There are
3002 columns. The first column indicates Email name. The name has been set with numbers
and not recipients' name to protect privacy. The last column has the labels for prediction : 1
for spam, 0 for not spam. The remaining 3000 columns are the 3000 most common words in
all the emails, after excluding the non-alphabetical characters/words. For each row, the
count of each word(column) in that email(row) is stored in the respective cells. Thus,
information regarding all 5172 emails are stored in a compact dataframe rather than as
separate text files.

Link: https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

Objective of the Assignment:

Students should be able to classify email using the binary Classification and implement
email spam detection technique by using K-Nearest Neighbors and Support Vector Machine
algorithm.

Prerequisite:
1. Basic knowledge of Python
PDEA College of Engineering Manjari
1
Department of Computer Engineering Course: Laboratory Practice-III

2. Concept of K-Nearest Neighbors and Support Vector Machine for classification.

Contents of the Theory:

1. Data Preprocessing
2. Binary Classification
3. K-Nearest Neighbours
4. Support Vector Machine
5. Train, Test and Split Procedure

Data Preprocessing:
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the ﬁrst and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in a
formatted way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required tasks
for cleaning the data and making it suitable for a machine learning model which also increases the
accuracy and eﬃciency of a machine learning model.
It involves below steps:

● Getting the dataset

● Importing libraries
● Importing datasets
● Finding Missing Data
● Encoding Categorical Data
● Splitting dataset into training and test set
● Feature scaling

PDEA College of Engineering Manjari

2
Department of Computer Engineering Course: Laboratory Practice-III

Binary Classification

Binary classification refers to those classification tasks that have two class labels.

Examples include:

● Email spam detection (spam or not).

● Churn prediction (churn or not).
● Conversion prediction (buy or not).

Typically, binary classification tasks involve one class that is the normal state and another
class that is the abnormal state.

For example “not spam” is the normal state and “spam” is the abnormal state. Another
example is “cancer not detected” is the normal state of a task that involves a medical test
and “cancer detected” is the abnormal state.

The class for the normal state is assigned the class label 0 and the class with the abnormal
state is assigned the class label 1.

It is common to model a binary classification task with a model that predicts a Bernoulli
probability distxribution for each example.

The Bernoulli distribution is a discrete probability distribution that covers a case where an
event will have a binary outcome as either a 0 or 1. For classification, this means that the
model predicts a probability of an example belonging to class 1, or the abnormal state.

Popular algorithms that can be used for binary classification include:

● Logistic Regression
● k-Nearest Neighbors
● Decision Trees
● Support Vector Machine
● Naive Bayes

PDEA College of Engineering Manjari

3
Department of Computer Engineering Course: Laboratory Practice-III

Some algorithms are specifically designed for binary classification and do not natively
support more than two classes; examples include Logistic Regression and Support Vector
Machines.

Support Vector
Machine:

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which
is used for Classiﬁcation as well as Regression problems. However, primarily, it is used for
Classiﬁcation problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct category
in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classiﬁed using a
decision boundary or hyperplane:

PDEA College of Engineering Manjari

4
Department of Computer Engineering Course: Laboratory Practice-III

Example: SVM can be understood with the example that we have used in the KNN classiﬁer.
Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will ﬁrst train our model with lots of images of cats and dogs
so that it can learn about different features of cats and dogs, and then we test it with this
strange creature. So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the
below diagram:

PDEA College of Engineering Manjari

5
Department of Computer Engineering Course: Laboratory Practice-III

SVM algorithm can be used for Face detection, image classiﬁcation, text
categorization,
etc.

Types of SVM

SVM can be of two types:

● Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.

● Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.

Train, Test, Split Procedure:

Train test split is a model validation procedure that allows you to simulate how
a model would perform on new/unseen data. Here is how the procedure works:

PDEA College of Engineering Manjari

6
Department of Computer Engineering Course: Laboratory Practice-III
1. ARRANGE THE DATA

Make sure your data is arranged into a format acceptable for train test split. In
scikit-learn, this consists of separating your full data set into “Features” and
“Target.”

2. SPLIT THE DATA

Split the data set into two pieces — a training set and a testing set. This
consists of random sampling without replacement about 75 percent of the rows
(you can vary this) and putting them into your training set. The remaining 25
percent is put into your test set. Note that the colors in “Features” and “Target”
indicate where their data will go (“X_train,” “X_test,” “y_train,” “y_test”) for a
particular train test split.

3. TRAIN THE MODEL

Train the model on the training set. This is “X_train” and “y_train” in the image.

4. TEST THE MODEL

Test the model on the testing set (“X_test” and “y_test” in the image) and
evaluate the performance.

Conclusion:

In this way we have explored Concept of Email Spam detection by using

binary classiﬁcation.

Assignment Questions:

1. What is Binary Classiﬁcation?

2. Explain Support Vector Machine?
3. Explain K-Nearest Neighbour algorithm for Machine Learning?

PDEA College of Engineering Manjari

Assignment B 2 EmailClassification
No ratings yet
Assignment B 2 EmailClassification
6 pages
LP III ML Assignment 2
No ratings yet
LP III ML Assignment 2
4 pages
Comparative Study of Four Supervised Machine Learning Techniques For Classification
No ratings yet
Comparative Study of Four Supervised Machine Learning Techniques For Classification
15 pages
Machine Learning: Classification & Naive Bayes
No ratings yet
Machine Learning: Classification & Naive Bayes
20 pages
Machine Learning Report
No ratings yet
Machine Learning Report
22 pages
Classifying Data Using Support Vector Machines (SVMS) in Python
No ratings yet
Classifying Data Using Support Vector Machines (SVMS) in Python
5 pages
ML Unit 3 V1
No ratings yet
ML Unit 3 V1
25 pages
The Hundred Page Machine Learning 2019
No ratings yet
The Hundred Page Machine Learning 2019
4 pages
SVM Guide for Data Scientists
No ratings yet
SVM Guide for Data Scientists
24 pages
Prediction On Iris
No ratings yet
Prediction On Iris
14 pages
ML Unit2
No ratings yet
ML Unit2
22 pages
Machine Learning Classifiers Guide
No ratings yet
Machine Learning Classifiers Guide
39 pages
ML Unit 3 Part B Material
No ratings yet
ML Unit 3 Part B Material
15 pages
Mod09-ppt2-ML in Image Classification
No ratings yet
Mod09-ppt2-ML in Image Classification
30 pages
Machine Learning: Dr. Windhya Rankothge (PHD - Upf, Barcelona)
No ratings yet
Machine Learning: Dr. Windhya Rankothge (PHD - Upf, Barcelona)
44 pages
L6 Lecture Image - Classification.fundemental v4
No ratings yet
L6 Lecture Image - Classification.fundemental v4
66 pages
Classification of Cyber Attacks Using Support Vector Machine
100% (1)
Classification of Cyber Attacks Using Support Vector Machine
4 pages
Report Minor Project PDF
No ratings yet
Report Minor Project PDF
37 pages
Algorithm of Neural Network M4
No ratings yet
Algorithm of Neural Network M4
25 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
20MEMECH Part 3 - Classification
No ratings yet
20MEMECH Part 3 - Classification
49 pages
Lecture 02 Supervised Learning 27102022 124322am
No ratings yet
Lecture 02 Supervised Learning 27102022 124322am
29 pages
ML Chapter 3
No ratings yet
ML Chapter 3
45 pages
Experiment # 10
No ratings yet
Experiment # 10
10 pages
AML Unit 4 Part 1
No ratings yet
AML Unit 4 Part 1
14 pages
Diabetic Prediction with SVM
No ratings yet
Diabetic Prediction with SVM
6 pages
Tan 2021 J. Phys. Conf. Ser. 1994 012016
No ratings yet
Tan 2021 J. Phys. Conf. Ser. 1994 012016
6 pages
Summer of Science-Final Report
100% (1)
Summer of Science-Final Report
7 pages
A Study On Support Vector Machine Based Linear and Non-Linear Pattern Classification
No ratings yet
A Study On Support Vector Machine Based Linear and Non-Linear Pattern Classification
5 pages
DL Highlights
No ratings yet
DL Highlights
6 pages
Evaluation of Different Classifier
No ratings yet
Evaluation of Different Classifier
4 pages
Classification
No ratings yet
Classification
7 pages
B43 Exp3 ML
No ratings yet
B43 Exp3 ML
5 pages
Chapter Four - Part One
No ratings yet
Chapter Four - Part One
44 pages
Unit 3 Aam
No ratings yet
Unit 3 Aam
30 pages
SVM7
No ratings yet
SVM7
53 pages
Spam Email Dection
No ratings yet
Spam Email Dection
23 pages
UNIT-II-Support Vector Machine Algorithm
No ratings yet
UNIT-II-Support Vector Machine Algorithm
13 pages
Machine Learning Midterm
No ratings yet
Machine Learning Midterm
18 pages
Introduction to Classification in AI
No ratings yet
Introduction to Classification in AI
66 pages
ML Spam Detection for Developers
No ratings yet
ML Spam Detection for Developers
51 pages
ML Exp 3 Part A
No ratings yet
ML Exp 3 Part A
7 pages
Machine Learning
No ratings yet
Machine Learning
32 pages
AP For NLP-LO2
No ratings yet
AP For NLP-LO2
38 pages
QUESTIONS
No ratings yet
QUESTIONS
20 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
19 pages
Unit 3 Ds
No ratings yet
Unit 3 Ds
10 pages
Machine Learning Algorithms 1728923216
No ratings yet
Machine Learning Algorithms 1728923216
12 pages
AI Chapter 3 Part 3
No ratings yet
AI Chapter 3 Part 3
49 pages
Comparison of Naive Bayes and Support Vector Machine Classifier
No ratings yet
Comparison of Naive Bayes and Support Vector Machine Classifier
6 pages
AIML Unit3
No ratings yet
AIML Unit3
48 pages
Unit 2
No ratings yet
Unit 2
16 pages
ML Notes - 2025
No ratings yet
ML Notes - 2025
145 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
37 pages
Machine Learning UNIT-2: Logistic Regression
No ratings yet
Machine Learning UNIT-2: Logistic Regression
12 pages
INT354 - Unit 3
No ratings yet
INT354 - Unit 3
60 pages
Presented By: M. Saqib Iqbal Gull Muhammad Presented To: Mr. Imran Ali Khan Artificial Intelligence National College of Bussiness Administration & Economics Multan
No ratings yet
Presented By: M. Saqib Iqbal Gull Muhammad Presented To: Mr. Imran Ali Khan Artificial Intelligence National College of Bussiness Administration & Economics Multan
11 pages
Assignment No 2 - ML
No ratings yet
Assignment No 2 - ML
6 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
56 pages
SDLC Spiral Model
No ratings yet
SDLC Spiral Model
3 pages
Django Url Shortenerbackend
No ratings yet
Django Url Shortenerbackend
9 pages
Ibm Infosphere Optim Data Growth Solution Validated Integration With Oracle E-Business Suite 12.2
No ratings yet
Ibm Infosphere Optim Data Growth Solution Validated Integration With Oracle E-Business Suite 12.2
2 pages
Tle ST 4 - Q1
No ratings yet
Tle ST 4 - Q1
2 pages
Java Chapter 3 - Variables & Operators
No ratings yet
Java Chapter 3 - Variables & Operators
19 pages
Naukri RajendraPandey (6y 5m)
No ratings yet
Naukri RajendraPandey (6y 5m)
3 pages
Practical SIMD Programming Guide
No ratings yet
Practical SIMD Programming Guide
17 pages
HSYD201 1 Jul Dec2024 FA3 OD V.2 23052024
No ratings yet
HSYD201 1 Jul Dec2024 FA3 OD V.2 23052024
5 pages
Terraform Certified
75% (4)
Terraform Certified
121 pages
01 - Bank Communication Management (BCM) in S4 Hana
80% (5)
01 - Bank Communication Management (BCM) in S4 Hana
21 pages
Identification and Traceability
No ratings yet
Identification and Traceability
16 pages
Narella Vamshi Resume May2024 PDF
No ratings yet
Narella Vamshi Resume May2024 PDF
1 page
LAB6 Final
No ratings yet
LAB6 Final
4 pages
OMRS Project PPT Gcutv8
No ratings yet
OMRS Project PPT Gcutv8
22 pages
What Is A Flowchart?
No ratings yet
What Is A Flowchart?
11 pages
Shopify Partners Learning Liquid 2020
No ratings yet
Shopify Partners Learning Liquid 2020
92 pages
Create EC2 Instance in AWS
No ratings yet
Create EC2 Instance in AWS
19 pages
Polars Vs Pandas - Benchmarking Performances and Beyond - LinkedIn
No ratings yet
Polars Vs Pandas - Benchmarking Performances and Beyond - LinkedIn
12 pages
C++ Slides - I: Objects and Classes: Structure in C and C++, Class Specification, Objects
No ratings yet
C++ Slides - I: Objects and Classes: Structure in C and C++, Class Specification, Objects
63 pages
Reverse Shells for Pentesters
No ratings yet
Reverse Shells for Pentesters
5 pages
Container Leaks
No ratings yet
Container Leaks
12 pages
C Programming Course Overview
No ratings yet
C Programming Course Overview
2 pages
PowerShell Guide for Exchange Admins
No ratings yet
PowerShell Guide for Exchange Admins
8 pages
Perl Regex
No ratings yet
Perl Regex
3 pages
Onvertor
No ratings yet
Onvertor
22 pages
Easy Mart 5th Sem
No ratings yet
Easy Mart 5th Sem
42 pages
Test Kejsy Dlya Registracii Na Sajte
No ratings yet
Test Kejsy Dlya Registracii Na Sajte
15 pages
AspenONE Installation Guide
No ratings yet
AspenONE Installation Guide
6 pages
How To Measure The Time in Multi
No ratings yet
How To Measure The Time in Multi
6 pages
SQL Server Duplicate Key Error
No ratings yet
SQL Server Duplicate Key Error
243 pages

ML PR 2-1

Uploaded by

ML PR 2-1

Uploaded by

Department of Computer Engineering Course: Laboratory Practice-III

Expected Date of Completion:...................... Actual Date of Completion:......................

Objective of the Assignment:

2. Concept of K-Nearest Neighbors and Support Vector Machine for classification.

Contents of the Theory:

Why do we need Data Preprocessing?

● Getting the dataset

PDEA College of Engineering Manjari

● Email spam detection (spam or not).

Popular algorithms that can be used for binary classification include:

PDEA College of Engineering Manjari

PDEA College of Engineering Manjari

PDEA College of Engineering Manjari

SVM can be of two types:

Train, Test, Split Procedure:

PDEA College of Engineering Manjari

2. SPLIT THE DATA

3. TRAIN THE MODEL

4. TEST THE MODEL

In this way we have explored Concept of Email Spam detection by using

1. What is Binary Classiﬁcation?

PDEA College of Engineering Manjari

You might also like