Email Spam Filtering Based on Multinomial Naive
Bayes Algorithm
Submitted in partial fulfillment of the requirements of the degree
BACHELOR OF ENGINEERING IN ARTIFICIAL
INTELLIGENCE AND MACHINE LEARNING ENGINEERING
BY
Mr.Jatin Mewada - 51
Mr.Aryan Bhujbal - 15
Mr.Yash Bhoir - 14
Mr.Mohit Borana -17
Supervisor
Mr. Sushant Gawade
Department of Artificial Intelligence and Machine
Learning Engineering
UNIVERSAL COLLEGE OF ENGINEERING, KAMAN.
University of Mumbai
(AY 2024-25)
Vidya Vikas Education Trust’s
Universal College of Engineering, Vasai (E)
Department of Artificial Intelligence and
Machine Learning Engineering
*
CERTIFICATE
This is to certify that the Mini Project entitled “Email Spam Filtering Based on Multinomial
Naive Bayes Algorithm” is the bonafide work of Mr. Jatin Mewada (51), Mr. Yash Bhoir
(14), Mr. Aaryan Bhujbal (15), and Mr. Mohit Borana (17).
This project has been submitted to the University of Mumbai in fulfillment of the
requirements for the Mini Project – Semester IV as part of the Second Year Artificial
Intelligence and Machine Learning Engineering program at Universal College of
Engineering, Vasai, Mumbai.
This work has been carried out under the Department of Artificial Intelligence and
Machine Learning Engineering during the academic year 2024-2025, Semester IV.
(Mr. Sushant Gawade)
Supervisor
(Mrs. Poonam Thakre) (Dr. J.B.Patil)
Head of Department Principal
Mini Project Approval
This Mini Project entitled “Email Spam Filtering Based on Multinomial Naive
Bayes Algorithm” By Mr. Jatin Mewada (51), Mr.Yash Bhoir (14), Mr.Aaryan Bhujbal
(15),Mr.Mohit Borana (17) approved for the degree of Bachelor of Engineering in
Artificial Intelligence and Machine Learning Engineering.
Examiners
1………………………………………
(Internal Examiner Name & Sign)
2…………………………………………
(External Examiner name & Sign)
Date:
Place:
1
Contents
Abstract i
Acknowledgments ii
List of Figures iii
List of Symbols iv
1 Introduction 1
1.1 Introduction
1.2 Motivation
1.3 Problem Statement & Objectives
2 Literature Survey 11
2.1 Survey of Existing System
2.2 Limitation Existing system or research gap
2.3 Mini Project Contribution
3 Proposed System 18
3.1 Proposed System
3.2 Architecture/ Framework
3.3 Algorithm and Process Design
3.4 Conclusion and Future work
4.FUTURE WORK
Conclusion 31
Appendix 32
Publication 35
References 38
2
Abstract
This project aims to build a simple yet effective email spam detection system
using machine learning techniques. The model is trained on a labeled dataset of emails to
classify messages as either spam or not spam. The raw text data is preprocessed by removing
punctuation, filtering out common stopwords, and converting the cleaned text into numerical
features using the CountVectorizer.
A Multinomial Naive Bayes algorithm is employed for training due to its suitability for text
classification tasks. The dataset is split into training and testing sets to evaluate model
performance. The trained model and vectorizer are saved using the pickle library for real-time
usage. A prediction function is also developed to classify new email messages and display
results with confidence levels.
The project demonstrates a complete pipeline from data loading and preprocessing to model
training and prediction, offering a lightweight and fast solution for basic spam email detection.
3
Acknowledgment
We take this opportunity to express our deep sense of gratitude to our project guide
and project coordinator guide Mr. Sushant Gawade, for her continuous guidance and
encouragement throughout the duration of our project work. It is because of her experience
and wonderful knowledge. We can fulfill the requirement of completing the project within the
stipulated time. We would also like to thank Mrs. Poonam Thakre, head of computer
engineering department for his encouragement, whole-hearted cooperation and support.
We would also like to thank our Principal Dr. J. B. Patil and the management of Universal
College of Engineering, Vasai, Mumbai for providing us all the facilities and the work friendly
environment. We acknowledge with thanks, the assistance provided by departmental staff,
library and lab attendants.
Mr. Jatin Mewada (51)
Mr.Yash Bhoir (14)
Mr.Mohit Borana (17)
Mr.Aaryan Bhujbal (15)
4
Chapter 1
Introduction
1.1 Introduction
Email spam classification is the process of identifying and filtering unwanted emails,
commonly known as spam. These emails often contain advertisements, phishing links, or
malicious attachments that pose security threats to users. With the increasing volume of emails
exchanged daily, spam filtering has become essential for maintaining efficient and secure
communication.
Traditional spam detection methods, such as rule-based filtering and blacklists, were once
used to block spam. However, spammers continuously modify their tactics to bypass these
static filters, making them less effective over time. This has led to the adoption of more
advanced techniques, particularly machine learning-based approaches.
Machine learning algorithms such as Multinomial Naïve Bayes, Support Vector Machines
(SVM), and deep learning models analyze various email features, including text patterns,
metadata, and sender reputation, to classify emails accurately. These models can adapt to new
spam patterns, improving detection accuracy compared to traditional methods.
Despite advancements in spam classification, challenges remain, including handling
imbalanced datasets, detecting sophisticated phishing emails, and reducing false positives.
Ongoing research focuses on enhancing spam filters using Natural Language Processing
(NLP) and deep learning to improve accuracy and adaptability against evolving spam threats.
5
1.2 Motivation
Email has become an essential communication tool for individuals, businesses, and
organizations. However, the increasing volume of spam emails disrupts productivity, clutters
inboxes, and consumes storage space. More critically, many spam emails contain phishing
links, malware, or fraudulent schemes that pose significant security risks. The need to filter
out these harmful emails while ensuring that legitimate messages are not blocked is a key
motivation behind email spam classification.
Traditional spam detection methods, such as rule-based filtering and blacklists, are becoming
less effective as spammers continuously evolve their techniques. Simple keyword-based
filters can be easily bypassed using misspellings, special characters, or misleading content.
This limitation has driven the need for more intelligent spam classification systems that can
learn from patterns and adapt to new spam strategies.
Machine learning and artificial intelligence (AI) have revolutionized email spam detection by
enabling automated and adaptive filtering. Algorithms such as Naïve Bayes, Support Vector
Machines (SVM), and deep learning models analyze email content, metadata, and sender
reputation to distinguish between spam and legitimate messages. These techniques
significantly improve accuracy and reduce the number of false positives and false negatives.
Research in this field continues to focus on developing robust models that can adapt to
evolving spam tactics while maintaining high efficiency.
An effective spam classification system benefits both individual users and organizations by
reducing security risks, improving email management, and enhancing user experience. As
spammers develop new evasion techniques, continuous advancements in spam detection
technology are essential to ensuring secure and reliable email communication.
1.3 Problem Statement & Objectives
Spam emails pose significant challenges to email users by cluttering inboxes, wasting storage
space, and exposing users to security threats such as phishing attacks, malware, and financial
fraud. Traditional rule-based spam filters struggle to adapt to evolving spam techniques,
leading to the need for more advanced and intelligent classification systems. The objective of
this study is to develop an effective email spam classification model using machine learning
and artificial intelligence techniques. This involves analyzing various email features,
6
improving detection accuracy, reducing false positives, and enhancing adaptability to new
spam patterns. Additionally, the study aims to explore Natural Language Processing (NLP)
and deep learning approaches to improve spam detection while ensuring minimal impact on
legitimate email communications
.
Figure 1 : Formula of Navie Bayes
7
Chapter 2
Literature Survey
Email spam classification has been a widely researched area due to the increasing volume of
spam emails and the security threats they pose . Over the years, different techniques have been
developed to distinguish spam from legitimate emails. This survey examines various
approaches used in spam classification, highlighting their strengths and weaknesses.
1. Rule-Based and Heuristic Approaches
Simple Implementation: Rule-based filters, such as keyword detection and blacklists.
Immediate Effectiveness: These methods can quickly block known spam patterns.
Limited Adaptability: Static rules struggle to detect new and evolving spam tactics.
High False Positives: Legitimate emails may be mistakenly classified as spam due to
key word-based filtering.
2. Naive Bayes Classifier
Efficient and Fast: Naïve Bayes is computationally lightweight and effective for text
classification.
Probabilistic Approach: It calculates the likelihood of an email being spam based on
word frequency.
Sensitivity to Feature Selection: The accuracy depends on selecting the right words
or phrases as classification features.
Easily Deceived: Spammers can bypass this method by modifying email content to
avoid common spam words.
3. Support Vector Machines (SVM)
High Accuracy: SVMs perform well in high-dimensional spaces, improving spam
detection rates
8
Better Generalization: Works effectively even with small datasets compared to probabilistic
models.
Computationally Expensive: Training SVM models can be slow, especially with large email
datasets.
Difficulty in Handling Noisy Data: SVM struggles when dealing with overlapping
spam and ham messages
9
2.1 Limitation Existing System or Research Gap
1. Adaptability to Evolving Spam Techniques
Challenge: Spammers continuously develop new techniques, such
obfuscation, adversarial attacks, and evolving phishing strategies, which can bypass
traditional spam filters.
Research Gap: There is a need for adaptive learning models that can detect and
respond to emerging spam patterns without frequent manual updates.
2. High False Positives and False Negatives
Challenge: Many spam filters incorrectly classify legitimate emails as spam (false
positives) or fail to detect spam messages (false negatives), leading to user frustration
and security risks.
Research Gap: Research is needed to improve classification accuracy through
advanced hybrid models that reduce false positives while maintaining high spam
detection rates.
3. Phishing and Malicious Content Detection
Challenge: Many spam classifiers struggle to detect phishing emails and malicious
attachments, as these threats often mimic legitimate communication styles.
Research Gap: There is an opportunity for research into multi-layered detection
approaches that integrate email content analysis, link verification, and sender
reputation checks to improve phishing detection.
4. Scalability and Computational Efficiency
10
Challenge: Advanced machine learning and deep learning models require significant
computational resources, making real-time spam detection challenging for large-scale
email services.
Research Gap: Efficient, lightweight models that maintain high accuracy while
optimizing processing speed and resource usage need to be developed.
Figure 2: Step Wise Process of Spam Detection
11
2.2 Mini Project Contribution
This mini project on Email Spam Detection using Machine Learning was with the
collaborative effort. Each member contributed to different phases of the project,
ensuring smooth development and successful completion. The following table outlines
the individual contributions:
Team Members Contribution
[ARYAN BHUJBAL] Collected and cleaned the dataset, implemented the text
preprocessing functions (punctuation removal, stopword filtering), and handled
vectorization using CountVectorizer. Also created the prediction function.
[JATIN MEWADA] Built and trained the Multinomial Naive Bayes model, performed
model evaluation (accuracy, confusion matrix, classification report), and saved the
model using pickle.
[YASH BHOIR] Worked on data visualization using matplotlib and seaborn, created
visual reports (spam frequency, heatmap), and handled dataset exploration and duplicate
removal.
[MOHIT BORANA] Developed the structure for the project documentation, wrote the
abstract, conclusion, future work, and designed the presentation slides.
12
Chapter 3
Proposed System
3.1 Proposed System
The system processes raw email data by cleaning and converting it into a machine-
readable format. It then uses a Multinomial Naive Bayes classifier — a probabilistic algorithm
well-suited for text classification problems — to learn patterns associated with spam and ham
(non-spam) emails. Once trained, the model can analyze new emails and predict their class
with a high degree of accuracy.
Key Features of the Proposed System
1.Automated Spam Detection
Automatically identifies spam emails using content-based analysis without the need for
manual filtering.
2.Efficient Preprocessing
Includes removal of stopwords, punctuation, and irrelevant content, helping the model focus
on meaningful words.
3.Bag-of-Words Model (CountVectorizer)
Transforms email text into numerical vectors based on word frequency for machine learning
compatibility.
4.Machine Learning Classification
Uses the Multinomial Naive Bayes algorithm, which is fast, efficient, and performs well with
high-dimensional text data.
5.Model Persistence
The trained model and vectorizer are saved using pickle, making the system reusable and
efficient for real-world deployment.
6.User-Friendly Interface (CLI or Web)
13
The system includes a sample prediction function, and can easily be extended to a
command-line interface or web app for ease of use.
Figure 3: Proposed System of Spam Detection
14
.
3.1.1 Project Module
1. Data Collection Module
Objective: To acquire a dataset of email messages labeled as spam or not spam.
Description: This module involves gathering or importing an existing email dataset (e.g.,
emails.csv) that contains two main columns: the email text and the label (spam = 1 or 0).
Tools/Tech: pandas, Google Drive (for data storage in Collab)
Functionality: Load dataset using pandas.read_csv()
Display basic information, stats, and structure
2. Data Preprocessing Module
Objective: To clean and prepare the email text data for modeling.
Description: Raw email data is full of noise such as punctuation, special characters, and
stopwords. This module processes the text to convert it into a machine-understandable format.
Steps Involved: Remove punctuation and lowercase all words
Eliminate stopwords using NLTK
Tokenize and join the clean words
Tools/Tech: string, nltk, stopwords, custom process() function
3. Feature Extraction Module
Objective: To transform the cleaned email text into numerical feature vectors.
Description: Text data is converted into a numeric matrix using the CountVectorizer technique
which helps in representing email text as word frequency counts.
15
Functionality:
Initialize CountVectorizer with custom analyzer
Fit and transform the data into a sparse matrix
Save the vectorizer for future predictions
Tools/Tech: sklearn.feature_extraction.text.CountVectorizer, pickle
4. Model Training Module
Objective: To train a machine learning model using the preprocessed dataset.
Description: A Multinomial Naive Bayes classifier is trained using the feature vectors to classify emails as
spam or not spam.
Steps Involved:
Split dataset into training and testing sets (80/20)
Train the model on training data
Save the trained model
Tools/Tech: sklearn.naive_bayes.MultinomialNB, train_test_split, pickle
5. Model Evaluation Module
Objective: To evaluate the performance of the trained model.
Description: The accuracy, precision, recall, F1-score, and confusion matrix are computed to evaluate
model effectiveness.
Functionality:
Make predictions on test set
16
Generate accuracy score, classification report
Plot confusion matrix as heatmap
Tools/Tech: sklearn.metrics, seaborn, matplotlib
6. Prediction Module
Objective: To predict whether new/unseen emails are spam or not.
Description: This module loads the saved model and vectorizer, processes the input email, and makes a
prediction.
Functionality:
Load model and vectorizer using pickle
Vectorize new input
Return prediction label and confidence score
Tools/Tech: pickle, MultinomialNB.predict_proba()
7. User Interface / Integration Module (Optional)
Objective: To build an interface (web or CLI) for users to interact with the system.
Future Scope: Can be implemented using Flask, Streamlit, or a basic Python UI using tkinter.
17
3.2 Algorithm and Process Design
The following section describes the algorithm and process design details.
3.2.1 Algorithm
1. Collect Data.
2. Preprocess Emails
3. Feature Extraction
4. Train a Machine Learning Model
5. Test and Evaluate the Model
6. Deploy and Classify Emails
7. Update Model Regularly
3.3.2 Process Design
1. User Authentication: Secure login and session management.
2. Product Search: Keyword-based product retrieval.
3.RecommendationSystem: Personalized product suggestions
18
Figure 4: Process of Spam Detecting During Message Flow
19
CHAPTER 4
FUTURE WORK
1. Better Text Processing
Use more advanced techniques like removing stopwords, lemmatization, or using TF-IDF to
make the input data better.
2. Try Other Models
We can try using other machine learning models like SVM or Random Forest, and even
combine them with Naïve Bayes for better accuracy.
3. Use Deep Learning
Deep learning models like LSTM or BERT can be tested to understand the meaning of emails
more deeply.
4. Handle Imbalanced Data
If there are more ham emails than spam, we can balance the data to improve results.
5. Real-Time Detection
Improve the system so that it can detect spam in real-time when new emails arrive.
6. Support for More Languages
The current model works mainly for English. It can be improved to detect spam in other
languages too.
7. Detect Image or File Spam
Some spam emails have images or attachments. We can work on detecting those as well.
20
Conclusion
In this project, we successfully implemented a spam detection system using fundamental
machine learning techniques and simple text preprocessing methods. The system effectively
classifies email messages into spam and non-spam categories using a Multinomial Naive
Bayes model, which is known for its efficiency and accuracy in text-based classification
problems. By leveraging basic preprocessing techniques—such as removing punctuation,
filtering out stopwords, and converting text data into numerical format using
CountVectorizer—we created a reliable feature set suitable for model training.
The performance of the model was evaluated using a variety of metrics including accuracy
score, confusion matrix, and classification report, all of which indicated strong predictive
capabilities. Additionally, we ensured the reusability of our model by saving both the trained
model and vectorizer using the pickle library, allowing real-time spam prediction for new
emails. A custom prediction function was also developed to demonstrate practical usage of
the model.
The implementation also included saving the model and vectorizer for future use, along with
a user-friendly prediction function to classify new email inputs. This end-to-end system—
from data preprocessing and training to prediction—proves that even with fundamental tools
and techniques, it is possible to build a functional and effective spam detection system. This
project serves as a solid foundation for understanding how machine learning can be applied
in practical, real-life applications.
Overall, this project provides a strong foundation in understanding text classification
workflows, model building, evaluation, and deployment, and it emphasizes how even
straightforward approaches can lead to meaningful and impactful results
21
Appendix
1) PYTHON
Python is a high-level, interpreted programming language created by Guido van
Rossum and first released in 1991. It emphasizes code readability with its use of
significant indentation. Python supports multiple programming paradigms,
including procedural, object-oriented, and functional programming. It has a vast
standard library, making it suitable for a wide range of applications, from web
development to data science, automation, and artificial intelligence. Python is open-
source, managed by the Python Software Foundation, and has a large, active
community contributing to its ongoing development and support
1. Data Handling & Processing
pickle – Serializes and deserializes Python objects.
numpy – Provides support for large, multi-dimensional arrays and mathematical
functions.
pandas – Offers data manipulation and analysis tools.
2. Data Visualization
seaborn – Enhances data visualization with attractive statistical graphics.
matplotlib.pyplot – Enables plotting and visualization of data.
3. Text Processing & Machine Learning
string – Provides string manipulation functions.
sklearn.feature_extraction.text.CountVectorizer – Converts text data into numerical
format.
sklearn.model_selection.train_test_split – Splits data into training and testing sets.
sklearn.naive_bayes.MultinomialNB – Implements the Naïve Bayes classifier for
text classification.
sklearn.metrics – Provides performance evaluation metrics like accuracy.
References
[1] Mahmoud Jazzar, Derar Eleyan, “Evaluation of Machine Learning Techniques
for Email Spam Classification” Published Online August 2021 in MECS
(http://www.mecs-press.org/) DOI: 10.5815/ijeme.2021.04.04.
[2] MuhammadFurqanJaved ,Iqbal Murtz , “Improving spam email classification
accuracy using ensemble techniques: a stacking approach” Int. J. Inf. Secur. 23,
505–517 (2024). https://doi.org/10.1007/s10207-023-00756-1
[3]. Sriram Srinivasan , vinayakumar R , Sowmya V, “ Deep convolutional neural network
based image spam classication” 2020 6th Conference on Data Science and Machine
Learning Applications (CDMA).
[4]. Rami Mustafa A. Mohammad, “Alifelong spam emails classification model” 20
January 2020.
[5]. Deepika Mallampati, Nagaratna Hegde,“A Machine Learning Based Email
Spam Classification Framework Model: Related Challenges and Issue” Blue Eyes
Intelligence Engineering & Sciences Publication 4 February 2020
6. Mahmoud Jazzar, Rasheed Yousef,“EVALUTION OF MACHINE LEARNING
TECHNIQUES FOR EMAIL SPAM CLASSIFICATION” e August 2021 in MECS
(http://www.mecs-press.org/) DOI: 10.5815/ijeme.2021.04.04
7.Francisco Jáñez-Martinoa,b, Rocío Alaiz-Rodrígueza,“Classifying spam emails
using agglomerative hierarchical clustering and a topic-based approach”
\https://www.rediris.es/index.php.en Retrieved December 2021
SCHEDULE FOR MINI PROJECT
Guide
Date Week Content Remark
Sign
1 Formation of Mini project groups
2 Submission of 3 Tittles of Mini project from each
date
Presentation of each group in front of the panel to
3 finalize of mini-project tittle from 3 different tittles.
Allocation of guide for every mini project groups
4 as per the domain expertise
5 Literature survey on finalize topics and planning.
6 Making of PPTs with respect to finalized tittles
Report to allocated Guide and finalize the ppt with
7 proper detail as suggested by guide
Second presentation of miniprojector with all detail
8 in front of panel.
9 Report to Guide
10 Report to Guide
11 Complete all work and Report to Guide
Final presentation in front of external examiner at
12 the time of university exam
EXAMINER'S FEEDBACK FORM
Name of External examiner: ___________________________
College of External examiner: ___________________________
Name of Internal examiner: ___________________________
Date of Examination: / /
No. of students in project team:
Availability of separate lab for the project: Yes / No
Student Performance Analysis (Put Tick as per your Observation)
Excellent (3) Very Good (2) Good (1)
Sr. No. Observation (3) (2) (1)
1 Quality of problem and Clarity
2 Innovativeness in solutions
3 Cost effectiveness and Societal impact
4 Full functioning of working model as per stated requirements
5 Effective use of skill sets
6 Effective use of standard engineering norms
7 Contribution of an individual’s as member or leader
8 Clarity in written and oral communication
9 Overall performance
o Can the same mini project extend to next semester by adding new objectives/ideas? (Yes/ No)
o If yes, suggest new Innovative Technique/Idea/ objectives related to this project.
Signature of External Examiner Signature of Internal Examiner
25