0% found this document useful (0 votes)
25 views27 pages

Report1 4 Sem New Final

The document presents a mini project titled 'Email Spam Filtering Based on Multinomial Naive Bayes Algorithm' submitted by students from Universal College of Engineering as part of their Bachelor of Engineering in Artificial Intelligence and Machine Learning. The project aims to develop an effective email spam detection system using machine learning techniques, specifically the Multinomial Naive Bayes algorithm, to classify emails as spam or not spam. It includes a comprehensive overview of the project's objectives, methodology, and contributions from each team member, along with a literature survey on existing spam detection methods.

Uploaded by

boranamohit872
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views27 pages

Report1 4 Sem New Final

The document presents a mini project titled 'Email Spam Filtering Based on Multinomial Naive Bayes Algorithm' submitted by students from Universal College of Engineering as part of their Bachelor of Engineering in Artificial Intelligence and Machine Learning. The project aims to develop an effective email spam detection system using machine learning techniques, specifically the Multinomial Naive Bayes algorithm, to classify emails as spam or not spam. It includes a comprehensive overview of the project's objectives, methodology, and contributions from each team member, along with a literature survey on existing spam detection methods.

Uploaded by

boranamohit872
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Email Spam Filtering Based on Multinomial Naive

Bayes Algorithm

Submitted in partial fulfillment of the requirements of the degree


BACHELOR OF ENGINEERING IN ARTIFICIAL
INTELLIGENCE AND MACHINE LEARNING ENGINEERING

BY

Mr.Jatin Mewada - 51
Mr.Aryan Bhujbal - 15
Mr.Yash Bhoir - 14
Mr.Mohit Borana -17

Supervisor
Mr. Sushant Gawade

Department of Artificial Intelligence and Machine


Learning Engineering
UNIVERSAL COLLEGE OF ENGINEERING, KAMAN.
University of Mumbai
(AY 2024-25)
Vidya Vikas Education Trust’s
Universal College of Engineering, Vasai (E)
Department of Artificial Intelligence and
Machine Learning Engineering
*

CERTIFICATE

This is to certify that the Mini Project entitled “Email Spam Filtering Based on Multinomial
Naive Bayes Algorithm” is the bonafide work of Mr. Jatin Mewada (51), Mr. Yash Bhoir
(14), Mr. Aaryan Bhujbal (15), and Mr. Mohit Borana (17).

This project has been submitted to the University of Mumbai in fulfillment of the
requirements for the Mini Project – Semester IV as part of the Second Year Artificial

Intelligence and Machine Learning Engineering program at Universal College of


Engineering, Vasai, Mumbai.

This work has been carried out under the Department of Artificial Intelligence and
Machine Learning Engineering during the academic year 2024-2025, Semester IV.

(Mr. Sushant Gawade)

Supervisor

(Mrs. Poonam Thakre) (Dr. J.B.Patil)


Head of Department Principal
Mini Project Approval

This Mini Project entitled “Email Spam Filtering Based on Multinomial Naive
Bayes Algorithm” By Mr. Jatin Mewada (51), Mr.Yash Bhoir (14), Mr.Aaryan Bhujbal
(15),Mr.Mohit Borana (17) approved for the degree of Bachelor of Engineering in
Artificial Intelligence and Machine Learning Engineering.

Examiners

1………………………………………
(Internal Examiner Name & Sign)

2…………………………………………
(External Examiner name & Sign)

Date:
Place:

1
Contents

Abstract i

Acknowledgments ii

List of Figures iii

List of Symbols iv

1 Introduction 1
1.1 Introduction
1.2 Motivation
1.3 Problem Statement & Objectives

2 Literature Survey 11

2.1 Survey of Existing System


2.2 Limitation Existing system or research gap
2.3 Mini Project Contribution

3 Proposed System 18

3.1 Proposed System


3.2 Architecture/ Framework
3.3 Algorithm and Process Design
3.4 Conclusion and Future work
4.FUTURE WORK

Conclusion 31

Appendix 32

Publication 35

References 38

2
Abstract
This project aims to build a simple yet effective email spam detection system
using machine learning techniques. The model is trained on a labeled dataset of emails to
classify messages as either spam or not spam. The raw text data is preprocessed by removing
punctuation, filtering out common stopwords, and converting the cleaned text into numerical
features using the CountVectorizer.

A Multinomial Naive Bayes algorithm is employed for training due to its suitability for text
classification tasks. The dataset is split into training and testing sets to evaluate model
performance. The trained model and vectorizer are saved using the pickle library for real-time
usage. A prediction function is also developed to classify new email messages and display
results with confidence levels.

The project demonstrates a complete pipeline from data loading and preprocessing to model
training and prediction, offering a lightweight and fast solution for basic spam email detection.

3
Acknowledgment
We take this opportunity to express our deep sense of gratitude to our project guide
and project coordinator guide Mr. Sushant Gawade, for her continuous guidance and
encouragement throughout the duration of our project work. It is because of her experience
and wonderful knowledge. We can fulfill the requirement of completing the project within the
stipulated time. We would also like to thank Mrs. Poonam Thakre, head of computer
engineering department for his encouragement, whole-hearted cooperation and support.

We would also like to thank our Principal Dr. J. B. Patil and the management of Universal
College of Engineering, Vasai, Mumbai for providing us all the facilities and the work friendly
environment. We acknowledge with thanks, the assistance provided by departmental staff,
library and lab attendants.

Mr. Jatin Mewada (51)

Mr.Yash Bhoir (14)

Mr.Mohit Borana (17)

Mr.Aaryan Bhujbal (15)

4
Chapter 1

Introduction

1.1 Introduction

Email spam classification is the process of identifying and filtering unwanted emails,
commonly known as spam. These emails often contain advertisements, phishing links, or
malicious attachments that pose security threats to users. With the increasing volume of emails
exchanged daily, spam filtering has become essential for maintaining efficient and secure
communication.

Traditional spam detection methods, such as rule-based filtering and blacklists, were once
used to block spam. However, spammers continuously modify their tactics to bypass these
static filters, making them less effective over time. This has led to the adoption of more
advanced techniques, particularly machine learning-based approaches.

Machine learning algorithms such as Multinomial Naïve Bayes, Support Vector Machines
(SVM), and deep learning models analyze various email features, including text patterns,
metadata, and sender reputation, to classify emails accurately. These models can adapt to new
spam patterns, improving detection accuracy compared to traditional methods.

Despite advancements in spam classification, challenges remain, including handling


imbalanced datasets, detecting sophisticated phishing emails, and reducing false positives.
Ongoing research focuses on enhancing spam filters using Natural Language Processing
(NLP) and deep learning to improve accuracy and adaptability against evolving spam threats.

5
1.2 Motivation
Email has become an essential communication tool for individuals, businesses, and
organizations. However, the increasing volume of spam emails disrupts productivity, clutters
inboxes, and consumes storage space. More critically, many spam emails contain phishing
links, malware, or fraudulent schemes that pose significant security risks. The need to filter
out these harmful emails while ensuring that legitimate messages are not blocked is a key
motivation behind email spam classification.
Traditional spam detection methods, such as rule-based filtering and blacklists, are becoming
less effective as spammers continuously evolve their techniques. Simple keyword-based
filters can be easily bypassed using misspellings, special characters, or misleading content.
This limitation has driven the need for more intelligent spam classification systems that can
learn from patterns and adapt to new spam strategies.
Machine learning and artificial intelligence (AI) have revolutionized email spam detection by
enabling automated and adaptive filtering. Algorithms such as Naïve Bayes, Support Vector
Machines (SVM), and deep learning models analyze email content, metadata, and sender
reputation to distinguish between spam and legitimate messages. These techniques
significantly improve accuracy and reduce the number of false positives and false negatives.
Research in this field continues to focus on developing robust models that can adapt to
evolving spam tactics while maintaining high efficiency.
An effective spam classification system benefits both individual users and organizations by
reducing security risks, improving email management, and enhancing user experience. As
spammers develop new evasion techniques, continuous advancements in spam detection
technology are essential to ensuring secure and reliable email communication.

1.3 Problem Statement & Objectives

Spam emails pose significant challenges to email users by cluttering inboxes, wasting storage
space, and exposing users to security threats such as phishing attacks, malware, and financial
fraud. Traditional rule-based spam filters struggle to adapt to evolving spam techniques,
leading to the need for more advanced and intelligent classification systems. The objective of
this study is to develop an effective email spam classification model using machine learning
and artificial intelligence techniques. This involves analyzing various email features,

6
improving detection accuracy, reducing false positives, and enhancing adaptability to new
spam patterns. Additionally, the study aims to explore Natural Language Processing (NLP)
and deep learning approaches to improve spam detection while ensuring minimal impact on
legitimate email communications
.

Figure 1 : Formula of Navie Bayes

7
Chapter 2

Literature Survey

Email spam classification has been a widely researched area due to the increasing volume of
spam emails and the security threats they pose . Over the years, different techniques have been
developed to distinguish spam from legitimate emails. This survey examines various
approaches used in spam classification, highlighting their strengths and weaknesses.

1. Rule-Based and Heuristic Approaches

Simple Implementation: Rule-based filters, such as keyword detection and blacklists.

Immediate Effectiveness: These methods can quickly block known spam patterns.
Limited Adaptability: Static rules struggle to detect new and evolving spam tactics.
High False Positives: Legitimate emails may be mistakenly classified as spam due to
key word-based filtering.

2. Naive Bayes Classifier

Efficient and Fast: Naïve Bayes is computationally lightweight and effective for text

classification.

Probabilistic Approach: It calculates the likelihood of an email being spam based on

word frequency.

Sensitivity to Feature Selection: The accuracy depends on selecting the right words

or phrases as classification features.

Easily Deceived: Spammers can bypass this method by modifying email content to

avoid common spam words.

3. Support Vector Machines (SVM)

High Accuracy: SVMs perform well in high-dimensional spaces, improving spam

detection rates
8
Better Generalization: Works effectively even with small datasets compared to probabilistic
models.

Computationally Expensive: Training SVM models can be slow, especially with large email
datasets.

Difficulty in Handling Noisy Data: SVM struggles when dealing with overlapping
spam and ham messages

9
2.1 Limitation Existing System or Research Gap
1. Adaptability to Evolving Spam Techniques

Challenge: Spammers continuously develop new techniques, such

obfuscation, adversarial attacks, and evolving phishing strategies, which can bypass

traditional spam filters.

Research Gap: There is a need for adaptive learning models that can detect and

respond to emerging spam patterns without frequent manual updates.

2. High False Positives and False Negatives

Challenge: Many spam filters incorrectly classify legitimate emails as spam (false

positives) or fail to detect spam messages (false negatives), leading to user frustration

and security risks.

Research Gap: Research is needed to improve classification accuracy through

advanced hybrid models that reduce false positives while maintaining high spam

detection rates.

3. Phishing and Malicious Content Detection

Challenge: Many spam classifiers struggle to detect phishing emails and malicious

attachments, as these threats often mimic legitimate communication styles.

Research Gap: There is an opportunity for research into multi-layered detection

approaches that integrate email content analysis, link verification, and sender

reputation checks to improve phishing detection.

4. Scalability and Computational Efficiency


10
Challenge: Advanced machine learning and deep learning models require significant

computational resources, making real-time spam detection challenging for large-scale

email services.

Research Gap: Efficient, lightweight models that maintain high accuracy while

optimizing processing speed and resource usage need to be developed.

Figure 2: Step Wise Process of Spam Detection

11
2.2 Mini Project Contribution
This mini project on Email Spam Detection using Machine Learning was with the
collaborative effort. Each member contributed to different phases of the project,
ensuring smooth development and successful completion. The following table outlines
the individual contributions:
Team Members Contribution
[ARYAN BHUJBAL] Collected and cleaned the dataset, implemented the text
preprocessing functions (punctuation removal, stopword filtering), and handled
vectorization using CountVectorizer. Also created the prediction function.
[JATIN MEWADA] Built and trained the Multinomial Naive Bayes model, performed
model evaluation (accuracy, confusion matrix, classification report), and saved the
model using pickle.
[YASH BHOIR] Worked on data visualization using matplotlib and seaborn, created
visual reports (spam frequency, heatmap), and handled dataset exploration and duplicate
removal.
[MOHIT BORANA] Developed the structure for the project documentation, wrote the
abstract, conclusion, future work, and designed the presentation slides.

12
Chapter 3
Proposed System

3.1 Proposed System

The system processes raw email data by cleaning and converting it into a machine-
readable format. It then uses a Multinomial Naive Bayes classifier — a probabilistic algorithm
well-suited for text classification problems — to learn patterns associated with spam and ham
(non-spam) emails. Once trained, the model can analyze new emails and predict their class
with a high degree of accuracy.

Key Features of the Proposed System


1.Automated Spam Detection
Automatically identifies spam emails using content-based analysis without the need for
manual filtering.

2.Efficient Preprocessing
Includes removal of stopwords, punctuation, and irrelevant content, helping the model focus
on meaningful words.

3.Bag-of-Words Model (CountVectorizer)


Transforms email text into numerical vectors based on word frequency for machine learning
compatibility.

4.Machine Learning Classification


Uses the Multinomial Naive Bayes algorithm, which is fast, efficient, and performs well with
high-dimensional text data.

5.Model Persistence
The trained model and vectorizer are saved using pickle, making the system reusable and
efficient for real-world deployment.

6.User-Friendly Interface (CLI or Web)

13
The system includes a sample prediction function, and can easily be extended to a
command-line interface or web app for ease of use.

Figure 3: Proposed System of Spam Detection

14
.

3.1.1 Project Module

1. Data Collection Module


Objective: To acquire a dataset of email messages labeled as spam or not spam.

Description: This module involves gathering or importing an existing email dataset (e.g.,
emails.csv) that contains two main columns: the email text and the label (spam = 1 or 0).

Tools/Tech: pandas, Google Drive (for data storage in Collab)

Functionality: Load dataset using pandas.read_csv()

Display basic information, stats, and structure

2. Data Preprocessing Module


Objective: To clean and prepare the email text data for modeling.

Description: Raw email data is full of noise such as punctuation, special characters, and
stopwords. This module processes the text to convert it into a machine-understandable format.

Steps Involved: Remove punctuation and lowercase all words

Eliminate stopwords using NLTK

Tokenize and join the clean words

Tools/Tech: string, nltk, stopwords, custom process() function

3. Feature Extraction Module


Objective: To transform the cleaned email text into numerical feature vectors.

Description: Text data is converted into a numeric matrix using the CountVectorizer technique
which helps in representing email text as word frequency counts.

15
Functionality:

Initialize CountVectorizer with custom analyzer

Fit and transform the data into a sparse matrix

Save the vectorizer for future predictions

Tools/Tech: sklearn.feature_extraction.text.CountVectorizer, pickle

4. Model Training Module


Objective: To train a machine learning model using the preprocessed dataset.

Description: A Multinomial Naive Bayes classifier is trained using the feature vectors to classify emails as
spam or not spam.

Steps Involved:

Split dataset into training and testing sets (80/20)

Train the model on training data

Save the trained model

Tools/Tech: sklearn.naive_bayes.MultinomialNB, train_test_split, pickle

5. Model Evaluation Module


Objective: To evaluate the performance of the trained model.

Description: The accuracy, precision, recall, F1-score, and confusion matrix are computed to evaluate
model effectiveness.
Functionality:

Make predictions on test set

16
Generate accuracy score, classification report

Plot confusion matrix as heatmap

Tools/Tech: sklearn.metrics, seaborn, matplotlib

6. Prediction Module
Objective: To predict whether new/unseen emails are spam or not.

Description: This module loads the saved model and vectorizer, processes the input email, and makes a
prediction.

Functionality:

Load model and vectorizer using pickle

Vectorize new input

Return prediction label and confidence score

Tools/Tech: pickle, MultinomialNB.predict_proba()

7. User Interface / Integration Module (Optional)


Objective: To build an interface (web or CLI) for users to interact with the system.

Future Scope: Can be implemented using Flask, Streamlit, or a basic Python UI using tkinter.

17
3.2 Algorithm and Process Design

The following section describes the algorithm and process design details.

3.2.1 Algorithm

1. Collect Data.

2. Preprocess Emails

3. Feature Extraction

4. Train a Machine Learning Model

5. Test and Evaluate the Model

6. Deploy and Classify Emails

7. Update Model Regularly

3.3.2 Process Design

1. User Authentication: Secure login and session management.

2. Product Search: Keyword-based product retrieval.

3.RecommendationSystem: Personalized product suggestions

18
Figure 4: Process of Spam Detecting During Message Flow

19
CHAPTER 4

FUTURE WORK

1. Better Text Processing


Use more advanced techniques like removing stopwords, lemmatization, or using TF-IDF to
make the input data better.
2. Try Other Models
We can try using other machine learning models like SVM or Random Forest, and even
combine them with Naïve Bayes for better accuracy.
3. Use Deep Learning
Deep learning models like LSTM or BERT can be tested to understand the meaning of emails
more deeply.
4. Handle Imbalanced Data
If there are more ham emails than spam, we can balance the data to improve results.
5. Real-Time Detection

Improve the system so that it can detect spam in real-time when new emails arrive.

6. Support for More Languages


The current model works mainly for English. It can be improved to detect spam in other
languages too.
7. Detect Image or File Spam
Some spam emails have images or attachments. We can work on detecting those as well.

20
Conclusion

In this project, we successfully implemented a spam detection system using fundamental


machine learning techniques and simple text preprocessing methods. The system effectively
classifies email messages into spam and non-spam categories using a Multinomial Naive
Bayes model, which is known for its efficiency and accuracy in text-based classification
problems. By leveraging basic preprocessing techniques—such as removing punctuation,
filtering out stopwords, and converting text data into numerical format using
CountVectorizer—we created a reliable feature set suitable for model training.

The performance of the model was evaluated using a variety of metrics including accuracy
score, confusion matrix, and classification report, all of which indicated strong predictive
capabilities. Additionally, we ensured the reusability of our model by saving both the trained
model and vectorizer using the pickle library, allowing real-time spam prediction for new
emails. A custom prediction function was also developed to demonstrate practical usage of
the model.

The implementation also included saving the model and vectorizer for future use, along with
a user-friendly prediction function to classify new email inputs. This end-to-end system—
from data preprocessing and training to prediction—proves that even with fundamental tools
and techniques, it is possible to build a functional and effective spam detection system. This
project serves as a solid foundation for understanding how machine learning can be applied
in practical, real-life applications.

Overall, this project provides a strong foundation in understanding text classification


workflows, model building, evaluation, and deployment, and it emphasizes how even
straightforward approaches can lead to meaningful and impactful results

21
Appendix
1) PYTHON
Python is a high-level, interpreted programming language created by Guido van
Rossum and first released in 1991. It emphasizes code readability with its use of
significant indentation. Python supports multiple programming paradigms,
including procedural, object-oriented, and functional programming. It has a vast
standard library, making it suitable for a wide range of applications, from web
development to data science, automation, and artificial intelligence. Python is open-
source, managed by the Python Software Foundation, and has a large, active
community contributing to its ongoing development and support

1. Data Handling & Processing

pickle – Serializes and deserializes Python objects.

numpy – Provides support for large, multi-dimensional arrays and mathematical


functions.

pandas – Offers data manipulation and analysis tools.

2. Data Visualization

seaborn – Enhances data visualization with attractive statistical graphics.

matplotlib.pyplot – Enables plotting and visualization of data.

3. Text Processing & Machine Learning

string – Provides string manipulation functions.

sklearn.feature_extraction.text.CountVectorizer – Converts text data into numerical


format.

sklearn.model_selection.train_test_split – Splits data into training and testing sets.

sklearn.naive_bayes.MultinomialNB – Implements the Naïve Bayes classifier for


text classification.

sklearn.metrics – Provides performance evaluation metrics like accuracy.


References
[1] Mahmoud Jazzar, Derar Eleyan, “Evaluation of Machine Learning Techniques
for Email Spam Classification” Published Online August 2021 in MECS
(http://www.mecs-press.org/) DOI: 10.5815/ijeme.2021.04.04.

[2] MuhammadFurqanJaved ,Iqbal Murtz , “Improving spam email classification


accuracy using ensemble techniques: a stacking approach” Int. J. Inf. Secur. 23,
505–517 (2024). https://doi.org/10.1007/s10207-023-00756-1

[3]. Sriram Srinivasan , vinayakumar R , Sowmya V, “ Deep convolutional neural network


based image spam classication” 2020 6th Conference on Data Science and Machine
Learning Applications (CDMA).

[4]. Rami Mustafa A. Mohammad, “Alifelong spam emails classification model” 20


January 2020.

[5]. Deepika Mallampati, Nagaratna Hegde,“A Machine Learning Based Email


Spam Classification Framework Model: Related Challenges and Issue” Blue Eyes
Intelligence Engineering & Sciences Publication 4 February 2020

6. Mahmoud Jazzar, Rasheed Yousef,“EVALUTION OF MACHINE LEARNING


TECHNIQUES FOR EMAIL SPAM CLASSIFICATION” e August 2021 in MECS
(http://www.mecs-press.org/) DOI: 10.5815/ijeme.2021.04.04

7.Francisco Jáñez-Martinoa,b, Rocío Alaiz-Rodrígueza,“Classifying spam emails


using agglomerative hierarchical clustering and a topic-based approach”
\https://www.rediris.es/index.php.en Retrieved December 2021
SCHEDULE FOR MINI PROJECT

Guide
Date Week Content Remark
Sign

1 Formation of Mini project groups

2 Submission of 3 Tittles of Mini project from each


date
Presentation of each group in front of the panel to
3 finalize of mini-project tittle from 3 different tittles.
Allocation of guide for every mini project groups
4 as per the domain expertise

5 Literature survey on finalize topics and planning.

6 Making of PPTs with respect to finalized tittles

Report to allocated Guide and finalize the ppt with


7 proper detail as suggested by guide

Second presentation of miniprojector with all detail


8 in front of panel.

9 Report to Guide

10 Report to Guide

11 Complete all work and Report to Guide

Final presentation in front of external examiner at


12 the time of university exam
EXAMINER'S FEEDBACK FORM

Name of External examiner: ___________________________


College of External examiner: ___________________________
Name of Internal examiner: ___________________________
Date of Examination: / /
No. of students in project team:
Availability of separate lab for the project: Yes / No

Student Performance Analysis (Put Tick as per your Observation)

Excellent (3) Very Good (2) Good (1)


Sr. No. Observation (3) (2) (1)
1 Quality of problem and Clarity
2 Innovativeness in solutions
3 Cost effectiveness and Societal impact
4 Full functioning of working model as per stated requirements
5 Effective use of skill sets

6 Effective use of standard engineering norms


7 Contribution of an individual’s as member or leader
8 Clarity in written and oral communication
9 Overall performance

o Can the same mini project extend to next semester by adding new objectives/ideas? (Yes/ No)
o If yes, suggest new Innovative Technique/Idea/ objectives related to this project.

Signature of External Examiner Signature of Internal Examiner

25

You might also like