0% found this document useful (0 votes)

34 views15 pages

PDFF

This document presents a machine learning-based SMS spam detection system using the Multinomial Naive Bayes algorithm, achieving an accuracy of 97.6% on a dataset of over 5,500 labeled messages. The methodology includes text preprocessing techniques such as lowercasing, stopword removal, and TF-IDF vectorization to prepare the data for classification. The findings indicate that the model effectively identifies spam messages while minimizing false positives, suggesting its real-world applicability in mobile communication environments.

Uploaded by

pranav1256kam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views15 pages

PDFF

Uploaded by

pranav1256kam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

SMS Spam Classification Using Multinomial

Naive Bayes
*
vachas pati pandey ,Pranav Kumar ,Dr. Sukesha
1 2

1
UIET, Punjab University, Chandigarh 160023, India 2

Pran2535kumr@gmail.com
3
UIET, Punjab University, Chandigarh 160023, India 2sukeshauiet@gmail.com

Abstract. With the exponential growth of mobile communication, SMS-based spam has
become a pervasive nuisance, leading to breaches in privacy, financial frauds, and
user frustration. This article explores the implementation of a machine learning-
based SMS spam detection system using the Multinomial Naive Bayes algorithm.
We utilize the SMS Spam Collection dataset, which comprises over 5,500 labeled
messages categorized as "ham" (legitimate) or "spam" (unwanted).
Text preprocessing steps including lowercasing, stopword removal, and
tokenization are applied to clean the dataset. Messages are then vectorized using
Term Frequency-Inverse Document Frequency (TF-IDF), a method that scales
down the importance of common terms and enhances rare but meaningful terms
across the dataset.
The transformed data is used to train a Multinomial Naive Bayes classifier — a
probabilistic model particularly effective for text data due to its assumption of
word occurrence independence. The classifier is trained on 80% of the data and
tested on the remaining 20%. Evaluation metrics include accuracy, precision,
recall, and F1-score. Our trained model achieves an accuracy of 97.6%, with high
precision indicating very few false positives and good recall ensuring most spam
messages are detected.
Visual analysis through a confusion matrix highlights the model's strengths and
errors. The curve formed on the heatmap emphasizes low misclassification rates.
Feature importance analysis shows spam indicators such as "free", "win", and
"urgent" as top contributors.
The findings suggest that with minimal computational resources, the proposed
model offers robust performance and real-world applicability. Potential
enhancements include integrating deep learning models or deploying the system
in mobile or web environments for real-time filtering.

Keywords: Spam detection ;Multimonial Naive bayes; Machine learnig

2
1 Introduction

1.2 Problem Statement

In the digital age, mobile communication via Short Message Service
(SMS) has become an essential medium for information exchange.
However, this widespread adoption has also opened doors to misuse—
particularly through the dissemination of spam messages. These spam
messages, ranging from irrelevant advertisements to phishing attacks and
fraudulent schemes, not only clutter user inboxes but also pose serious
risks to privacy, financial security, and trust in communication platforms.
Traditional spam detection systems are largely rule-based. These systems
operate by identifying specific keywords, sender addresses, or message
patterns that are known indicators of spam. While initially effective, such
systems quickly become outdated. Spammers frequently modify their
tactics by obfuscating keywords, using alternate spellings, or
incorporating benign-looking phrases to bypass filters. This results in a
significant rise in false negatives, where spam messages go undetected.
Conversely, rule-based filters can also produce false positives—flagging
legitimate messages as spam, thereby disrupting important
communication. This dual failure severely limits the practicality of static
filtering systems.
Moreover, SMS data itself presents unique challenges. Unlike formal
documents or structured emails, SMS messages are often short,
ungrammatical, and rife with abbreviations, emojis, slang, and
inconsistent casing. Such unstructured formats make it difficult to apply
conventional linguistic or statistical rules. This variability necessitates
dynamic models that can adapt to linguistic ambiguity and evolving spam
trends.
To overcome these limitations, the integration of machine learning (ML)
techniques becomes essential. ML models, particularly supervised
learning algorithms, can learn from historical message data to detect
patterns that indicate spam. Among them, the Multinomial Naive Bayes
(MNB) classifier stands out for text-based problems. It is computationally
efficient, requires relatively small training data, and performs well even
when features are conditionally independent—making it ideal for SMS
classification tasks.
This project aims to build an intelligent, adaptive spam detection system
using the
Multinomial Naive Bayes algorithm. By vectorizing messages using Term
FrequencyInverse Document Frequency (TF-IDF) techniques, the model
transforms raw text into structured numerical inputs. These are then used
to train the MNB classifier, enabling it to recognize subtle word frequency
distributions and co-occurrence patterns commonly found in spam
messages. The ultimate objective is to create a system that not only
improves accuracy and recall over traditional methods but also adapts
gracefully to new spam tactics in real-time deployment scenarios.
2 Literature Review
The problem of SMS spam detection has garnered significant attention in the
domain of Natural Language Processing (NLP), primarily due to its direct
3
impact on mobile communication security and user experience. Over the years,
researchers have experimented with various machine learning algorithms to
develop models capable of identifying unsolicited messages with high
accuracy.
Among the most widely used algorithms is the Naive Bayes classifier,
particularly the Multinomial variant, which is favored for its effectiveness in
handling highdimensional text data and its computational simplicity. Naive
Bayes assumes conditional independence among features, a simplification
that, while rarely true in real-world text data, works surprisingly well in
practice. Its efficiency and accuracy have made it a standard baseline for spam
classification tasks.

Other machine learning algorithms have also shown promise:

Support Vector Machines (SVM): These are powerful classifiers capable of
achieving high precision, particularly with well-separated classes. They are
known for their robustness and ability to generalize well on unseen data.
However, SVMs are computationally intensive, especially with large datasets
or non-linear kernels, making them less ideal for real-time spam filtering on
mobile devices.
Random Forests: As ensemble models, Random Forests use multiple decision
trees to improve classification accuracy. They handle structured data well and
can model complex interactions between features. Nonetheless, they can be
prone to overfitting, especially in unbalanced datasets, and their
interpretability is lower than simpler models like Naive Bayes.
Neural Networks: With the rise of deep learning, neural networks have been
applied to SMS spam detection, especially using architectures like LSTMs
and CNNs. While they can capture complex patterns and contextual
semantics, they demand large annotated datasets and significant
computational resources. This makes them impractical for lightweight,
realtime applications without powerful backend support.
Research by Almeida et al. (2011), which analyzed SMS spam detection using
various models, has been foundational in this space. Datasets such as the UCI
SMS Spam Collection have become benchmarks for experimentation and
validation of spam classification models. These studies provide a comparative
framework to evaluate new approaches and reinforce the efficacy of
lightweight, yet powerful models like Multinomial Naive Bayes in real-world
applications. identification of specific frequency bands that showed
prominence when cross- referenced with user-defined output data,
followed by the computational training data for the development of the long
short-term memory (LSTM) and Support vector Machine (SVM) as confusion
level classifiers.
4
3 Dataset Description
3.1 Source
The dataset used in this project is the SMS Spam Collection, a well-known corpus for
binary SMS classification tasks. It was obtained from the following public
repository: Dataset URL:

SMS Spam Collection - GitHub

Format: Tab-Separated Values (TSV)

3.2 Overview
This dataset contains a total of 5,572 SMS messages, each labeled either as "ham"
(legitimate) or "spam" (unwanted/unsolicited). It is widely used for benchmarking
text classification models due to its balance of simplicity and real-world relevance.

3.3 Data Sample

Here are a couple of examples illustrating the format:

• ham → “Go until jurong point, crazy..”

• spam → “Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005...”
These messages reflect the informal, varied nature of SMS communication, which
makes preprocessing and feature extraction critical.

3.4 Data Cleaning To prepare the dataset for model training, the following
preprocessing steps were applied: Label Encoding:

Textual labels were converted to binary form — ham → 0, spam → 1.

• Missing Values Check:

A thorough inspection revealed no missing values in the dataset.
• Text Normalization (explained in preprocessing section) includes converting to
lowercase, removing punctuation, stopwords, and tokenization.
3.5 Label Distribution

A key characteristic of this dataset is its class imbalance:

• Ham messages: 4,825 (≈ 86.6%)

• Spam messages: 747 (≈ 13.4%)

This imbalance could lead to biased models that overly favor the majority class.
To mitigate this, techniques like stratified sampling, balanced evaluation metrics
(e.g., precision-recall), and potential resampling methods
(undersampling/oversampling) are considered during model development to
ensure fair and robust performance.
5

3.1 Dataset Preprocessing

Preprocessing is a crucial step in preparing raw SMS data for machine learning. Since text
data is inherently unstructured, it must be converted into a structured, numerical form to be
processed by classification algorithms. This section outlines the key preprocessing
techniques applied to the SMS Spam Collection dataset to ensure consistency, reduce noise,
and extract meaningful features for modeling.

Lowercasing :The first preprocessing step involves converting all text to lowercase. This
normalization ensures that words like "Free" and "free" are treated as the same token.
Without lowercasing, the model might mistakenly consider different cases as separate
features, reducing overall accuracy and increasing dimensionality unnecessarily.

Stop Word Removal :Stop words are commonly used words such as “is”, “the”, “in”, “a”,
and “to” that typically do not carry meaningful information for classification tasks. These
are removed using the built-in stop word list from the scikit-learn TF-IDF vectorizer with
the parameter stop_words='english'. This reduces noise and improves the efficiency of
feature extraction by focusing only on informative words.

Tokenization :Tokenization is the process of breaking down a message into smaller units,
typically words, called tokens. Each token represents a feature that can be analyzed and
vectorized. For instance, the sentence "Win a free ticket now!" becomes the tokens: ["win",
"a", "free", "ticket", "now"].

TF-IDF Vectorization :After tokenization and cleaning, the next step is to transform the text
data into numerical form using TF-IDF (Term Frequency-Inverse Document Frequency).
This technique evaluates the importance of a word in a specific message relative to the
entire corpus. • Term Frequency (TF): Measures how often a word appears in a single
message.

Inverse Document Frequency (IDF): Measures how rare a word is across all messages.
Words that appear in many messages have lower IDF scores.
TF-IDF Score: Computed as TF × IDF, resulting in higher values for terms that are frequent
in a message but rare across the dataset. This method not only helps to identify key patterns
unique to spam or ham messages but also scales well for large datasets. The resulting TFIDF
matrix is a sparse, highdimensional numerical representation that serves as input to the
machine learning model.
Convert to lowercase → ensures consistency Remove stop words → eliminates irrelevant
noise Tokenize messages → breaks text into analyzable units Apply TF-IDF vectorization
→ converts text to meaningful numerical features

4 Methodoloy
The methodology outlines the approach used to build and evaluate the SMS spam
classification model. It includes selecting a suitable algorithm, preparing the
dataset for training and testing, and applying vectorization techniques to convert
text into numerical features.
6
Algorithm: Multinomial Naive Bayes The core of this spam detection system is the
Multinomial Naive Bayes classifier—a probabilistic algorithm based on Bayes'
theorem. This algorithm is particularly wellsuited for text classification problems,
where the features are the frequency of words or tokens.

The classifier estimates the probability that a message belongs to a class (ham
or spam) given the presence of certain words. The formula used is:
P(Class∣Text)∝P(Class)×∏P(Wordi∣Class)P(\text{Class} |
\text{Text}) \propto

P(\text{Class}) \times \prod P(\text{Word}_i |

\text{Class})P(Class∣Text)∝P(Class)×∏P(
Wordi∣Class) Here:
P(Class)P(\text{Class})P(Class) is the prior probability of a class (spam or ham),
P(Wordi∣Class)P(\text{Word}_i | \text{Class})P(Wordi∣Class) is the likelihood of
a word appearing in a message of that class.
The “naive” assumption here is that all words in the message are conditionally
independent given the class label—an oversimplification, but one that yields good
results in practice for text classification tasks like spam detection.
This algorithm is computationally efficient, requires minimal training data, and
has shown strong performance in detecting spam, making it ideal for this project.

4.2 Train-Test Split

To evaluate the model's effectiveness, the dataset is divided into training

and testing sets:

• 80% Training Data – used to train the model.

• 20% Testing Data – used to evaluate model performance.

The split is stratified, meaning that the proportion of spam and ham messages is
preserved in both sets. This ensures that the model is trained and testeon balanced
samples, reducing potential bias and improving generalizability.

Example:
Ham messages (86.6%) and spam messages (13.4%) retain this ratio in both
training and test sets.
Vectorization
Machine learning algorithms require numerical input, so the raw text messages are
transformed using TF-IDF Vectorization. This technique captures the relevance of
a word in a message relative to all messages in the dataset.

The vectorization process is implemented using: stop_words='english': Removes

common English words that add little semantic value. max_features=3000: Limits
the vocabulary to the top 3,000 most informative words, reducing computational
cost and overfitting.
This step produces a sparse matrix where each row represents a message and
each column corresponds to a unique word feature, scored by its TF-IDF value.

Summary of Methodology:
7
Use of Multinomial Naive Bayes for efficient and effective spam classification.

Stratified Train-Test Split to ensure fair model evaluation.

TF-IDF Vectorization to convert unstructured text into meaningful numerical
features.
This methodology provides a solid foundation for building an accurate, scalable,
and data-driven SMS spam detection model.

Model Training and Evaluation

To assess the effectiveness of our spam detection system, we trained and evaluated a
machine learning model using the Multinomial Naive Bayes algorithm. This
probabilistic classifier is particularly effective for text classification tasks where
features are discrete, such as word counts or frequencies.
Training the Model
We split the dataset into training and testing sets using an 80-20 stratified split to
preserve class balance. The text data was vectorized using the TFIDF method, which
converts messages into numerical feature vectors while downweighting common terms
and emphasizing unique ones. With the features prepared, we trained the Multinomial
Naive Bayes model on the training set: model = MultinomialNB() model.fit(X_train,
y_train)
This step involved learning the likelihood of each word occurring in spam and ham
messages, based on the training data.
Model Performance
The model was then tested on the remaining 20% of the dataset. It achieved an accuracy
of 97.6%, indicating that the vast majority of predictions were correct. A detailed
classification report further highlighted the model’s strong performance:
• Ham (legitimate messages): Precision = 0.99, Recall = 0.99, F1-score =
0.99

• Spam messages: Precision = 0.92, Recall = 0.92, F1-score = 0.92

These metrics show that the model is highly accurate in detecting spam while
minimizing false positives and negatives.
Confusion Matrix
The confusion matrix summarizes the results:
Actual \ Predicted Ham (0) Spam (1)
Ham (0) 956 9

Actual \ Predicted Ham (0) Spam (1)

Spam (1) 12 138

Out of 1,115 test messages, only 21 were misclassified. This low error rate
demonstrates the model’s reliability and its potential for real-world deployment.
Conclusion
The Multinomial Naive Bayes model, combined with TF-IDF vectorization, provides a
fast, scalable, and highly accurate solution for SMS spam detection. Its ability to
generalize across varied message formats makes it ideal for practical use in
messaging platforms and mobile networks.
8

.
9
5 Results and analysis
The spam detection model demonstrated strong performance metrics:
• Precision (Spam): 92% – Out of all messages predicted as spam, 92% were
indeed spam.

• Recall (Spam): 92% – The model successfully identified 92% of all actual spam
messages.

• High accuracy in 'ham' messages was expected due to the class imbalance
(majority class).

• Most misclassifications involved messages that blend promotional and casual

tones, reflecting the model's difficulty with ambiguous content.
10

6. Sample Predictions
After training the spam detection model using the Multinomial Naive Bayes algorithm
and TF-IDF vectorization, we created a simple Python function that can predict whether a
given SMS message is SPAM or NOT SPAM. This makes it easier to test the model in
real-world scenarios and validate its performance on messages outside the training set.
def test_email(text):
vector = tfidf.transform([text])
prediction = model.predict(vector)[0]
return "SPAM" if prediction == 1 else "NOT SPAM"
Step-by-Step Working:
Input: o The function accepts a single SMS message in plain text format.
o Example: "You’ve won a free prize! Click to claim now." Vectorization using
TF-IDF: o The message is converted into a numerical format (a sparse
vector) using tfidf.transform([text]).
o TF-IDF (Term Frequency–Inverse Document Frequency) gives importance to
rare but meaningful words like “free”, “win”, “urgent”, etc., which are often
present in spam messages.
Prediction: o The vector is passed to the trained Naive Bayes model for prediction:
model.predict(vector)[0].
o The model returns either 1 (for spam) or 0 (for not spam).
Final Output:
o Based on the model’s prediction, the function returns "SPAM" or "NOT
SPAM".

🧪 Real-World Prediction Examples:

✅ Example 1:
• Input:
"Congratulations! You have won a free iPhone. Click here to claim now."
• Prediction: SPAM
• Reason:
The message contains common spam indicators such as “Congratulations”,
“won”, “free”, and “Click here”, which are highly weighted in the TF-IDF
model. These words are frequently associated with promotional scams or
phishing messages.

✅ Example 2:
• Input:
"Hi team, please find attached the meeting agenda for tomorrow."
• Prediction:
11
NOT SPAM
• Reason:
This message is formal and professional, with no suspicious or promotional
terms. It resembles typical work or business communication, so the model
correctly classifies it as a legitimate (ham) message.
Importance of This Function:
• This function acts as a real-time predictor for SMS messages, helping end-users
or systems detect spam instantly.
• It can be easily integrated into web applications, mobile apps, or chat systems to
filter out spam content.
• It showcases how machine learning can be used in practical, user-facing solutions
with minimal effort once the model is trained.
12

6. Discussion – Strengths, Weaknesses & Alternatives

6.1 Strengths of the Model 1. Simplicity and Interpretability
The Multinomial Naive Bayes algorithm is based on probability and is very easy
to understand and interpret. It doesn’t require complex tuning or infrastructure,
which makes it perfect for beginners and quick deployments.
2. High Speed and Efficiency
This model is extremely fast during both training and prediction, making it
suitable for real-time SMS filtering applications or environments with limited
computational power like mobile devices. 3. Strong Accuracy and
Precision
The spam detection model achieved an impressive accuracy of around 97.6%. Its
precision for spam detection is 92%, which means it rarely marks a legitimate
message as spam, making it reliable and user-friendly.

6.2 Weaknesses of the Model

4. Independence Assumption (Bag-of-Words Limitation)
The model treats every word as independent from the others, which is not always
true in natural language. This can result in missing the actual meaning or context
of a message.
5. Confusion in Ambiguous Messages
It struggles with SMS content that blends informal and promotional language.
For example, "Hey! You have a great offer waiting" could be spam or just a
message from a friend, which confuses the model.
6. Class Imbalance Bias
Since the dataset contains more “ham” messages than “spam”, the model tends to
favor predicting non-spam. This causes it to occasionally miss actual spam
messages, affecting recall.

6.3 Alternatives and Enhancements

7. Using N-grams for Context Understanding
Including bigrams and trigrams (like “win prize”, “free entry now”) helps the
model understand context better, which improves accuracy on tricky messages. 8.
Deep Learning Models (LSTM, BERT)
Advanced models like LSTM and BERT can learn the order and meaning of
words in a sentence. They perform better on messages with complex or hidden
spam indicators but require more computation. 9. Data Augmentation and
Balancing
13
Oversampling the minority spam class using techniques like SMOTE can help
reduce bias and improve model performance on rare but important spam
patterns.

7. Conclusion – Key Takeaways & Future Scope

The SMS spam detection system built using Multinomial Naive Bayes and TF-IDF shows
how a simple approach can still be powerful. It offers:

• Ease of implementation
• Fast and efficient results
• High accuracy with minimal resources
However, as spam messages become smarter and more subtle, this basic model might not
be enough. Future work should aim to:

• Include n-grams and syntactic features for better context.

• Switch to deep learning approaches for more semantic understanding.
• Use language detection for multi-lingual spam filtering.
• Expand dataset to handle modern threats like phishing or OTP fraud.
• Deploy the system in mobile or web apps for real-time SMS protection.
14

8. Acknowledgements
We would like to acknowledge the following contributions and resources that were
invaluable in completing this project:
1. Scikit-learn Documentation: For providing comprehensive and easy-to-follow
documentation on the Multinomial Naive Bayes algorithm and TF-IDF vectorizer,
which were essential for building the spam detection model.
2. UCI Machine Learning Repository: For providing the SMS Spam Collection dataset,
which served as the foundation for training and testing our model.
3. Python Libraries:
o NumPy and Pandas: For data manipulation and preprocessing.
o Matplotlib and Seaborn: For visualizations that helped analyze the
dataset.
o Scikit-learn: For implementing the machine learning algorithms and
vectorization techniques.
4. Deep Learning Community: For the inspiration to explore more complex models
(such as LSTM and BERT) that could enhance the spam detection system.
5. Support from Mentors and Colleagues: For their valuable feedback and
suggestions that helped in refining the approach and model.

9. References
1. Scikit-learn Documentation:
o Scikit-learn: Machine Learning in Python. Available at: https://scikit-
learn.org/stable/
2. SMS Spam Collection Dataset:
o Almeida, T. A., & Silva, A. (2011). SMS Spam Collection Dataset. UCI
Machine Learning Repository. Available at:
https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
3. TF-IDF Vectorization:
o Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to
Information Retrieval. Cambridge University Press.
4. Multinomial Naive Bayes Algorithm:
o Rish, I. (2001). An Empirical Study of the Naive Bayes Classifier. IJCAI-01
Workshop on Empirical Methods in Artificial Intelligence, 41-46.
5. N-grams for Text Classification:
15

o Zhou, M., & Sukhbaatar, S. (2017). Exploring the use of n-grams for
document classification. In Proceedings of the 2017 International
Conference on Machine Learning.
6. Deep Learning Models for NLP (LSTM, BERT):
o Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. A.,
Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. In Advances
in Neural Information Processing Systems, 30.
7. SMOTE and Data Augmentation:
o Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002).
SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial
Intelligence Research, 16, 321-357.

Abh 1
No ratings yet
Abh 1
17 pages
Icdici 274 Spam Sms
No ratings yet
Icdici 274 Spam Sms
6 pages
Sms Spam Detcetion Review Paper
No ratings yet
Sms Spam Detcetion Review Paper
4 pages
SMS Spam Detection Using Machine Learning: An Experimental Study
No ratings yet
SMS Spam Detection Using Machine Learning: An Experimental Study
7 pages
Investigating Evasive Techniques in Sms Spam Filtering A Comparative Analysis of Machine Learning Models Ijariie26436
No ratings yet
Investigating Evasive Techniques in Sms Spam Filtering A Comparative Analysis of Machine Learning Models Ijariie26436
10 pages
Intern 2
No ratings yet
Intern 2
26 pages
Sms Spam Detection Project Final
No ratings yet
Sms Spam Detection Project Final
59 pages
Spam SMS (Or) Email Detection and Classification Using Machine Learning
No ratings yet
Spam SMS (Or) Email Detection and Classification Using Machine Learning
5 pages
B 14 Sms Spam Detection ML Ieee Report
No ratings yet
B 14 Sms Spam Detection ML Ieee Report
5 pages
Fedspam: Privacy Preserving Sms Spam Prediction
No ratings yet
Fedspam: Privacy Preserving Sms Spam Prediction
12 pages
Batch 6
No ratings yet
Batch 6
6 pages
Ijsse 14.01 28
No ratings yet
Ijsse 14.01 28
8 pages
Aiml Pro
No ratings yet
Aiml Pro
14 pages
Major Project by Ali (Intrainz)
No ratings yet
Major Project by Ali (Intrainz)
25 pages
Email Spam
No ratings yet
Email Spam
8 pages
Spam Detection with Python
No ratings yet
Spam Detection with Python
26 pages
Department of Cse (Artificial Intelligence & Data Science) : Sms Spam Detection
No ratings yet
Department of Cse (Artificial Intelligence & Data Science) : Sms Spam Detection
27 pages
Solution: March 2018
No ratings yet
Solution: March 2018
8 pages
ML Spam Detection for Developers
No ratings yet
ML Spam Detection for Developers
51 pages
Sms Spam Term Paper
No ratings yet
Sms Spam Term Paper
10 pages
A Hybrid Machine Learning Approach For Spam and Malware
No ratings yet
A Hybrid Machine Learning Approach For Spam and Malware
14 pages
SMS Spam Detection Methods
No ratings yet
SMS Spam Detection Methods
14 pages
Development of Content-Based SMS Classification Application by Using Word2Vec-based Feature Extraction
No ratings yet
Development of Content-Based SMS Classification Application by Using Word2Vec-based Feature Extraction
10 pages
Anchalora
No ratings yet
Anchalora
29 pages
Nisha Internship3
No ratings yet
Nisha Internship3
87 pages
SMS Spam Filtering for Academics
No ratings yet
SMS Spam Filtering for Academics
6 pages
Spam Detection System 1
No ratings yet
Spam Detection System 1
21 pages
Fjet 12 11 4
No ratings yet
Fjet 12 11 4
13 pages
Spam Detection Thesis
100% (3)
Spam Detection Thesis
6 pages
IJNRD2403165
No ratings yet
IJNRD2403165
5 pages
SMS Spam Detection with NLP
No ratings yet
SMS Spam Detection with NLP
21 pages
Final PPT
No ratings yet
Final PPT
18 pages
Content-Based Sms Spam Filtering Using Machine Learning Technique
No ratings yet
Content-Based Sms Spam Filtering Using Machine Learning Technique
7 pages
SMS Spam Detection with ML Algorithms
No ratings yet
SMS Spam Detection with ML Algorithms
4 pages
Sms Spaming Detection Using NLP Techniques
No ratings yet
Sms Spaming Detection Using NLP Techniques
9 pages
Future Generation Computer Systems: Pradeep Kumar Roy Jyoti Prakash Singh Snehasish Banerjee
No ratings yet
Future Generation Computer Systems: Pradeep Kumar Roy Jyoti Prakash Singh Snehasish Banerjee
10 pages
(KAVYA R SHETTY)
No ratings yet
(KAVYA R SHETTY)
21 pages
Spam SMS Filtering Based On Text Features and Supervised Machine Learning Techniques
No ratings yet
Spam SMS Filtering Based On Text Features and Supervised Machine Learning Techniques
19 pages
SMS Spam Detection Using H2O Framework, SMS Spam Detection Using H2O Framework
No ratings yet
SMS Spam Detection Using H2O Framework, SMS Spam Detection Using H2O Framework
8 pages
Message Spam Identification by Naive Bayes Classifier Algorithm Using Machine Learning
No ratings yet
Message Spam Identification by Naive Bayes Classifier Algorithm Using Machine Learning
5 pages
Opll
No ratings yet
Opll
20 pages
SMS Spam Detection for Developers
No ratings yet
SMS Spam Detection for Developers
9 pages
V24i0527 1714999068
No ratings yet
V24i0527 1714999068
7 pages
Spam Detection Synopsis
No ratings yet
Spam Detection Synopsis
8 pages
Spam Detection in Text Using Machine Learning 1
No ratings yet
Spam Detection in Text Using Machine Learning 1
85 pages
Report
No ratings yet
Report
19 pages
Sms Spam Filtering System Hybrid Approaches
No ratings yet
Sms Spam Filtering System Hybrid Approaches
25 pages
A Spam Transformer Model For SMS Spam Detection
No ratings yet
A Spam Transformer Model For SMS Spam Detection
11 pages
RTRP Batch 10
No ratings yet
RTRP Batch 10
20 pages
Synopsis Email Spam
No ratings yet
Synopsis Email Spam
9 pages
Black Yellow Modern Minimalist Elegant Presentation
No ratings yet
Black Yellow Modern Minimalist Elegant Presentation
29 pages
Project Report Template AICTE Internship 2025
No ratings yet
Project Report Template AICTE Internship 2025
20 pages
Research Paper Spam Detection
No ratings yet
Research Paper Spam Detection
4 pages
Khairanmarzuki, Journal Manager, MatrikJournal
No ratings yet
Khairanmarzuki, Journal Manager, MatrikJournal
14 pages
International Journal of Research Publication and Reviews: Spam Call Detection Using Machine Learning
No ratings yet
International Journal of Research Publication and Reviews: Spam Call Detection Using Machine Learning
9 pages
Applied Sciences: A Discrete Hidden Markov Model For SMS Spam Detection
No ratings yet
Applied Sciences: A Discrete Hidden Markov Model For SMS Spam Detection
17 pages
Format Termpaper
No ratings yet
Format Termpaper
9 pages
Sumit Kumar
No ratings yet
Sumit Kumar
58 pages
ML 2 Marks Quick Revision
No ratings yet
ML 2 Marks Quick Revision
3 pages
Check Balance and Imbalance Using Stack
No ratings yet
Check Balance and Imbalance Using Stack
2 pages
ML File 17 March
No ratings yet
ML File 17 March
18 pages
ML 01 (Shubham)
No ratings yet
ML 01 (Shubham)
14 pages
ML 01 (Pranavv)
No ratings yet
ML 01 (Pranavv)
14 pages
Pranavsql
No ratings yet
Pranavsql
26 pages
A.C. Joshi Library Panjab University, Chandigarh
No ratings yet
A.C. Joshi Library Panjab University, Chandigarh
1 page
F 2 L
No ratings yet
F 2 L
6 pages
Mat212 STA211 December 2012 Test 2a
No ratings yet
Mat212 STA211 December 2012 Test 2a
2 pages
Chain Ladder Excel Caritat
No ratings yet
Chain Ladder Excel Caritat
86 pages
Final Year Project Report
No ratings yet
Final Year Project Report
44 pages
Motion Planning Algorithms Guide
No ratings yet
Motion Planning Algorithms Guide
43 pages
Cross-Validation in Machine Learning
No ratings yet
Cross-Validation in Machine Learning
18 pages
Spe 19556 Pa
No ratings yet
Spe 19556 Pa
7 pages
FRL's & Service Road Levels
No ratings yet
FRL's & Service Road Levels
16 pages
Error Checking in Networks
No ratings yet
Error Checking in Networks
42 pages
Simon Speck PDF
No ratings yet
Simon Speck PDF
45 pages
Module 5 Concept Space - Version Space - Candidate Elimination
No ratings yet
Module 5 Concept Space - Version Space - Candidate Elimination
25 pages
Pole Zero Plot
No ratings yet
Pole Zero Plot
3 pages
MPC LAB Manual New
No ratings yet
MPC LAB Manual New
24 pages
Bangla Hand Written Digit Recognition
No ratings yet
Bangla Hand Written Digit Recognition
19 pages
(IJCST-V13I2P2) :seema Saroj, Sakshi Sahu, Sanjana Patel, Suraj Sahu
No ratings yet
(IJCST-V13I2P2) :seema Saroj, Sakshi Sahu, Sanjana Patel, Suraj Sahu
2 pages
AI Assignment 6 - Employee Performance Analysis - Jupyter Notebook
No ratings yet
AI Assignment 6 - Employee Performance Analysis - Jupyter Notebook
9 pages
A Solution Manual To A Practical Introduction To Python Programming by Brian Heinold
30% (27)
A Solution Manual To A Practical Introduction To Python Programming by Brian Heinold
5 pages
Digital Communications Exam Quiz
No ratings yet
Digital Communications Exam Quiz
1 page
Traffic Assignment
No ratings yet
Traffic Assignment
12 pages
Fall 2017 MAT350 Course Outline
No ratings yet
Fall 2017 MAT350 Course Outline
2 pages
Solution 1
63% (8)
Solution 1
3 pages
Matrices and Calculus Test Papers
No ratings yet
Matrices and Calculus Test Papers
2 pages
Ai-Augmented Security Models For Software Development: Leveraging Machine Learning For Threat Detection and Mitigation
No ratings yet
Ai-Augmented Security Models For Software Development: Leveraging Machine Learning For Threat Detection and Mitigation
11 pages
DSP Lab Guide for ECE Students
No ratings yet
DSP Lab Guide for ECE Students
54 pages
Kenny-230720-8 Unique Machine Learning Interview Questions About K Nearest Neighbors
No ratings yet
Kenny-230720-8 Unique Machine Learning Interview Questions About K Nearest Neighbors
3 pages
Probability & Statistics Homework 3
No ratings yet
Probability & Statistics Homework 3
13 pages
SIE 431 Simulation Modeling and Analysis Midterm Exam, May 15 2021 60 Minutes For Exam Name
No ratings yet
SIE 431 Simulation Modeling and Analysis Midterm Exam, May 15 2021 60 Minutes For Exam Name
9 pages
DAA 2marks With Answers
No ratings yet
DAA 2marks With Answers
11 pages
The Maximum Flow Problem
No ratings yet
The Maximum Flow Problem
64 pages
Binomial Distribution Lecture
No ratings yet
Binomial Distribution Lecture
15 pages

PDFF

Uploaded by

PDFF

Uploaded by

SMS Spam Classification Using Multinomial

Keywords: Spam detection ;Multimonial Naive bayes; Machine learnig

1.2 Problem Statement

Other machine learning algorithms have also shown promise:

SMS Spam Collection - GitHub

3.3 Data Sample

• ham → “Go until jurong point, crazy..”

Textual labels were converted to binary form — ham → 0, spam → 1.

• Missing Values Check:

A key characteristic of this dataset is its class imbalance:

• Ham messages: 4,825 (≈ 86.6%)

• Spam messages: 747 (≈ 13.4%)

3.1 Dataset Preprocessing

P(\text{Class}) \times \prod P(\text{Word}_i |

4.2 Train-Test Split

To evaluate the model's effectiveness, the dataset is divided into training

• 80% Training Data – used to train the model.

• 20% Testing Data – used to evaluate model performance.

The vectorization process is implemented using: stop_words='english': Removes

Stratified Train-Test Split to ensure fair model evaluation.

Model Training and Evaluation

• Spam messages: Precision = 0.92, Recall = 0.92, F1-score = 0.92

Actual \ Predicted Ham (0) Spam (1)

Spam (1) 12 138

• Most misclassifications involved messages that blend promotional and casual

🧪 Real-World Prediction Examples:

6. Discussion – Strengths, Weaknesses & Alternatives

6.2 Weaknesses of the Model

6.3 Alternatives and Enhancements

7. Conclusion – Key Takeaways & Future Scope

• Include n-grams and syntactic features for better context.

You might also like