0% found this document useful (0 votes)
34 views15 pages

PDFF

This document presents a machine learning-based SMS spam detection system using the Multinomial Naive Bayes algorithm, achieving an accuracy of 97.6% on a dataset of over 5,500 labeled messages. The methodology includes text preprocessing techniques such as lowercasing, stopword removal, and TF-IDF vectorization to prepare the data for classification. The findings indicate that the model effectively identifies spam messages while minimizing false positives, suggesting its real-world applicability in mobile communication environments.

Uploaded by

pranav1256kam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views15 pages

PDFF

This document presents a machine learning-based SMS spam detection system using the Multinomial Naive Bayes algorithm, achieving an accuracy of 97.6% on a dataset of over 5,500 labeled messages. The methodology includes text preprocessing techniques such as lowercasing, stopword removal, and TF-IDF vectorization to prepare the data for classification. The findings indicate that the model effectively identifies spam messages while minimizing false positives, suggesting its real-world applicability in mobile communication environments.

Uploaded by

pranav1256kam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

SMS Spam Classification Using Multinomial

Naive Bayes
*
vachas pati pandey ,Pranav Kumar ,Dr. Sukesha
1 2

1
UIET, Punjab University, Chandigarh 160023, India 2

Pran2535kumr@gmail.com
3
UIET, Punjab University, Chandigarh 160023, India 2sukeshauiet@gmail.com

Abstract. With the exponential growth of mobile communication, SMS-based spam has
become a pervasive nuisance, leading to breaches in privacy, financial frauds, and
user frustration. This article explores the implementation of a machine learning-
based SMS spam detection system using the Multinomial Naive Bayes algorithm.
We utilize the SMS Spam Collection dataset, which comprises over 5,500 labeled
messages categorized as "ham" (legitimate) or "spam" (unwanted).
Text preprocessing steps including lowercasing, stopword removal, and
tokenization are applied to clean the dataset. Messages are then vectorized using
Term Frequency-Inverse Document Frequency (TF-IDF), a method that scales
down the importance of common terms and enhances rare but meaningful terms
across the dataset.
The transformed data is used to train a Multinomial Naive Bayes classifier — a
probabilistic model particularly effective for text data due to its assumption of
word occurrence independence. The classifier is trained on 80% of the data and
tested on the remaining 20%. Evaluation metrics include accuracy, precision,
recall, and F1-score. Our trained model achieves an accuracy of 97.6%, with high
precision indicating very few false positives and good recall ensuring most spam
messages are detected.
Visual analysis through a confusion matrix highlights the model's strengths and
errors. The curve formed on the heatmap emphasizes low misclassification rates.
Feature importance analysis shows spam indicators such as "free", "win", and
"urgent" as top contributors.
The findings suggest that with minimal computational resources, the proposed
model offers robust performance and real-world applicability. Potential
enhancements include integrating deep learning models or deploying the system
in mobile or web environments for real-time filtering.

Keywords: Spam detection ;Multimonial Naive bayes; Machine learnig


2
1 Introduction

1.2 Problem Statement


In the digital age, mobile communication via Short Message Service
(SMS) has become an essential medium for information exchange.
However, this widespread adoption has also opened doors to misuse—
particularly through the dissemination of spam messages. These spam
messages, ranging from irrelevant advertisements to phishing attacks and
fraudulent schemes, not only clutter user inboxes but also pose serious
risks to privacy, financial security, and trust in communication platforms.
Traditional spam detection systems are largely rule-based. These systems
operate by identifying specific keywords, sender addresses, or message
patterns that are known indicators of spam. While initially effective, such
systems quickly become outdated. Spammers frequently modify their
tactics by obfuscating keywords, using alternate spellings, or
incorporating benign-looking phrases to bypass filters. This results in a
significant rise in false negatives, where spam messages go undetected.
Conversely, rule-based filters can also produce false positives—flagging
legitimate messages as spam, thereby disrupting important
communication. This dual failure severely limits the practicality of static
filtering systems.
Moreover, SMS data itself presents unique challenges. Unlike formal
documents or structured emails, SMS messages are often short,
ungrammatical, and rife with abbreviations, emojis, slang, and
inconsistent casing. Such unstructured formats make it difficult to apply
conventional linguistic or statistical rules. This variability necessitates
dynamic models that can adapt to linguistic ambiguity and evolving spam
trends.
To overcome these limitations, the integration of machine learning (ML)
techniques becomes essential. ML models, particularly supervised
learning algorithms, can learn from historical message data to detect
patterns that indicate spam. Among them, the Multinomial Naive Bayes
(MNB) classifier stands out for text-based problems. It is computationally
efficient, requires relatively small training data, and performs well even
when features are conditionally independent—making it ideal for SMS
classification tasks.
This project aims to build an intelligent, adaptive spam detection system
using the
Multinomial Naive Bayes algorithm. By vectorizing messages using Term
FrequencyInverse Document Frequency (TF-IDF) techniques, the model
transforms raw text into structured numerical inputs. These are then used
to train the MNB classifier, enabling it to recognize subtle word frequency
distributions and co-occurrence patterns commonly found in spam
messages. The ultimate objective is to create a system that not only
improves accuracy and recall over traditional methods but also adapts
gracefully to new spam tactics in real-time deployment scenarios.
2 Literature Review
The problem of SMS spam detection has garnered significant attention in the
domain of Natural Language Processing (NLP), primarily due to its direct
3
impact on mobile communication security and user experience. Over the years,
researchers have experimented with various machine learning algorithms to
develop models capable of identifying unsolicited messages with high
accuracy.
Among the most widely used algorithms is the Naive Bayes classifier,
particularly the Multinomial variant, which is favored for its effectiveness in
handling highdimensional text data and its computational simplicity. Naive
Bayes assumes conditional independence among features, a simplification
that, while rarely true in real-world text data, works surprisingly well in
practice. Its efficiency and accuracy have made it a standard baseline for spam
classification tasks.

Other machine learning algorithms have also shown promise:


Support Vector Machines (SVM): These are powerful classifiers capable of
achieving high precision, particularly with well-separated classes. They are
known for their robustness and ability to generalize well on unseen data.
However, SVMs are computationally intensive, especially with large datasets
or non-linear kernels, making them less ideal for real-time spam filtering on
mobile devices.
Random Forests: As ensemble models, Random Forests use multiple decision
trees to improve classification accuracy. They handle structured data well and
can model complex interactions between features. Nonetheless, they can be
prone to overfitting, especially in unbalanced datasets, and their
interpretability is lower than simpler models like Naive Bayes.
Neural Networks: With the rise of deep learning, neural networks have been
applied to SMS spam detection, especially using architectures like LSTMs
and CNNs. While they can capture complex patterns and contextual
semantics, they demand large annotated datasets and significant
computational resources. This makes them impractical for lightweight,
realtime applications without powerful backend support.
Research by Almeida et al. (2011), which analyzed SMS spam detection using
various models, has been foundational in this space. Datasets such as the UCI
SMS Spam Collection have become benchmarks for experimentation and
validation of spam classification models. These studies provide a comparative
framework to evaluate new approaches and reinforce the efficacy of
lightweight, yet powerful models like Multinomial Naive Bayes in real-world
applications. identification of specific frequency bands that showed
prominence when cross- referenced with user-defined output data,
followed by the computational training data for the development of the long
short-term memory (LSTM) and Support vector Machine (SVM) as confusion
level classifiers.
4
3 Dataset Description
3.1 Source
The dataset used in this project is the SMS Spam Collection, a well-known corpus for
binary SMS classification tasks. It was obtained from the following public
repository: Dataset URL:

SMS Spam Collection - GitHub


Format: Tab-Separated Values (TSV)

3.2 Overview
This dataset contains a total of 5,572 SMS messages, each labeled either as "ham"
(legitimate) or "spam" (unwanted/unsolicited). It is widely used for benchmarking
text classification models due to its balance of simplicity and real-world relevance.

3.3 Data Sample


Here are a couple of examples illustrating the format:

• ham → “Go until jurong point, crazy..”

• spam → “Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005...”
These messages reflect the informal, varied nature of SMS communication, which
makes preprocessing and feature extraction critical.

3.4 Data Cleaning To prepare the dataset for model training, the following
preprocessing steps were applied: Label Encoding:

Textual labels were converted to binary form — ham → 0, spam → 1.

• Missing Values Check:


A thorough inspection revealed no missing values in the dataset.
• Text Normalization (explained in preprocessing section) includes converting to
lowercase, removing punctuation, stopwords, and tokenization.
3.5 Label Distribution

A key characteristic of this dataset is its class imbalance:

• Ham messages: 4,825 (≈ 86.6%)

• Spam messages: 747 (≈ 13.4%)


This imbalance could lead to biased models that overly favor the majority class.
To mitigate this, techniques like stratified sampling, balanced evaluation metrics
(e.g., precision-recall), and potential resampling methods
(undersampling/oversampling) are considered during model development to
ensure fair and robust performance.
5

3.1 Dataset Preprocessing

Preprocessing is a crucial step in preparing raw SMS data for machine learning. Since text
data is inherently unstructured, it must be converted into a structured, numerical form to be
processed by classification algorithms. This section outlines the key preprocessing
techniques applied to the SMS Spam Collection dataset to ensure consistency, reduce noise,
and extract meaningful features for modeling.

Lowercasing :The first preprocessing step involves converting all text to lowercase. This
normalization ensures that words like "Free" and "free" are treated as the same token.
Without lowercasing, the model might mistakenly consider different cases as separate
features, reducing overall accuracy and increasing dimensionality unnecessarily.

Stop Word Removal :Stop words are commonly used words such as “is”, “the”, “in”, “a”,
and “to” that typically do not carry meaningful information for classification tasks. These
are removed using the built-in stop word list from the scikit-learn TF-IDF vectorizer with
the parameter stop_words='english'. This reduces noise and improves the efficiency of
feature extraction by focusing only on informative words.

Tokenization :Tokenization is the process of breaking down a message into smaller units,
typically words, called tokens. Each token represents a feature that can be analyzed and
vectorized. For instance, the sentence "Win a free ticket now!" becomes the tokens: ["win",
"a", "free", "ticket", "now"].

TF-IDF Vectorization :After tokenization and cleaning, the next step is to transform the text
data into numerical form using TF-IDF (Term Frequency-Inverse Document Frequency).
This technique evaluates the importance of a word in a specific message relative to the
entire corpus. • Term Frequency (TF): Measures how often a word appears in a single
message.

Inverse Document Frequency (IDF): Measures how rare a word is across all messages.
Words that appear in many messages have lower IDF scores.
TF-IDF Score: Computed as TF × IDF, resulting in higher values for terms that are frequent
in a message but rare across the dataset. This method not only helps to identify key patterns
unique to spam or ham messages but also scales well for large datasets. The resulting TFIDF
matrix is a sparse, highdimensional numerical representation that serves as input to the
machine learning model.
Convert to lowercase → ensures consistency Remove stop words → eliminates irrelevant
noise Tokenize messages → breaks text into analyzable units Apply TF-IDF vectorization
→ converts text to meaningful numerical features

4 Methodoloy
The methodology outlines the approach used to build and evaluate the SMS spam
classification model. It includes selecting a suitable algorithm, preparing the
dataset for training and testing, and applying vectorization techniques to convert
text into numerical features.
6
Algorithm: Multinomial Naive Bayes The core of this spam detection system is the
Multinomial Naive Bayes classifier—a probabilistic algorithm based on Bayes'
theorem. This algorithm is particularly wellsuited for text classification problems,
where the features are the frequency of words or tokens.

The classifier estimates the probability that a message belongs to a class (ham
or spam) given the presence of certain words. The formula used is:
P(Class∣Text)∝P(Class)×∏P(Wordi∣Class)P(\text{Class} |
\text{Text}) \propto

P(\text{Class}) \times \prod P(\text{Word}_i |


\text{Class})P(Class∣Text)∝P(Class)×∏P(
Wordi∣Class) Here:
P(Class)P(\text{Class})P(Class) is the prior probability of a class (spam or ham),
P(Wordi∣Class)P(\text{Word}_i | \text{Class})P(Wordi∣Class) is the likelihood of
a word appearing in a message of that class.
The “naive” assumption here is that all words in the message are conditionally
independent given the class label—an oversimplification, but one that yields good
results in practice for text classification tasks like spam detection.
This algorithm is computationally efficient, requires minimal training data, and
has shown strong performance in detecting spam, making it ideal for this project.

4.2 Train-Test Split

To evaluate the model's effectiveness, the dataset is divided into training


and testing sets:

• 80% Training Data – used to train the model.

• 20% Testing Data – used to evaluate model performance.


The split is stratified, meaning that the proportion of spam and ham messages is
preserved in both sets. This ensures that the model is trained and testeon balanced
samples, reducing potential bias and improving generalizability.

Example:
Ham messages (86.6%) and spam messages (13.4%) retain this ratio in both
training and test sets.
Vectorization
Machine learning algorithms require numerical input, so the raw text messages are
transformed using TF-IDF Vectorization. This technique captures the relevance of
a word in a message relative to all messages in the dataset.

The vectorization process is implemented using: stop_words='english': Removes


common English words that add little semantic value. max_features=3000: Limits
the vocabulary to the top 3,000 most informative words, reducing computational
cost and overfitting.
This step produces a sparse matrix where each row represents a message and
each column corresponds to a unique word feature, scored by its TF-IDF value.

Summary of Methodology:
7
Use of Multinomial Naive Bayes for efficient and effective spam classification.

Stratified Train-Test Split to ensure fair model evaluation.


TF-IDF Vectorization to convert unstructured text into meaningful numerical
features.
This methodology provides a solid foundation for building an accurate, scalable,
and data-driven SMS spam detection model.

Model Training and Evaluation


To assess the effectiveness of our spam detection system, we trained and evaluated a
machine learning model using the Multinomial Naive Bayes algorithm. This
probabilistic classifier is particularly effective for text classification tasks where
features are discrete, such as word counts or frequencies.
Training the Model
We split the dataset into training and testing sets using an 80-20 stratified split to
preserve class balance. The text data was vectorized using the TFIDF method, which
converts messages into numerical feature vectors while downweighting common terms
and emphasizing unique ones. With the features prepared, we trained the Multinomial
Naive Bayes model on the training set: model = MultinomialNB() model.fit(X_train,
y_train)
This step involved learning the likelihood of each word occurring in spam and ham
messages, based on the training data.
Model Performance
The model was then tested on the remaining 20% of the dataset. It achieved an accuracy
of 97.6%, indicating that the vast majority of predictions were correct. A detailed
classification report further highlighted the model’s strong performance:
• Ham (legitimate messages): Precision = 0.99, Recall = 0.99, F1-score =
0.99

• Spam messages: Precision = 0.92, Recall = 0.92, F1-score = 0.92


These metrics show that the model is highly accurate in detecting spam while
minimizing false positives and negatives.
Confusion Matrix
The confusion matrix summarizes the results:
Actual \ Predicted Ham (0) Spam (1)
Ham (0) 956 9

Actual \ Predicted Ham (0) Spam (1)

Spam (1) 12 138


Out of 1,115 test messages, only 21 were misclassified. This low error rate
demonstrates the model’s reliability and its potential for real-world deployment.
Conclusion
The Multinomial Naive Bayes model, combined with TF-IDF vectorization, provides a
fast, scalable, and highly accurate solution for SMS spam detection. Its ability to
generalize across varied message formats makes it ideal for practical use in
messaging platforms and mobile networks.
8

.
9
5 Results and analysis
The spam detection model demonstrated strong performance metrics:
• Precision (Spam): 92% – Out of all messages predicted as spam, 92% were
indeed spam.

• Recall (Spam): 92% – The model successfully identified 92% of all actual spam
messages.

• High accuracy in 'ham' messages was expected due to the class imbalance
(majority class).

• Most misclassifications involved messages that blend promotional and casual


tones, reflecting the model's difficulty with ambiguous content.
10

6. Sample Predictions
After training the spam detection model using the Multinomial Naive Bayes algorithm
and TF-IDF vectorization, we created a simple Python function that can predict whether a
given SMS message is SPAM or NOT SPAM. This makes it easier to test the model in
real-world scenarios and validate its performance on messages outside the training set.
def test_email(text):
vector = tfidf.transform([text])
prediction = model.predict(vector)[0]
return "SPAM" if prediction == 1 else "NOT SPAM"
Step-by-Step Working:
Input: o The function accepts a single SMS message in plain text format.
o Example: "You’ve won a free prize! Click to claim now." Vectorization using
TF-IDF: o The message is converted into a numerical format (a sparse
vector) using tfidf.transform([text]).
o TF-IDF (Term Frequency–Inverse Document Frequency) gives importance to
rare but meaningful words like “free”, “win”, “urgent”, etc., which are often
present in spam messages.
Prediction: o The vector is passed to the trained Naive Bayes model for prediction:
model.predict(vector)[0].
o The model returns either 1 (for spam) or 0 (for not spam).
Final Output:
o Based on the model’s prediction, the function returns "SPAM" or "NOT
SPAM".

🧪 Real-World Prediction Examples:


✅ Example 1:
• Input:
"Congratulations! You have won a free iPhone. Click here to claim now."
• Prediction: SPAM
• Reason:
The message contains common spam indicators such as “Congratulations”,
“won”, “free”, and “Click here”, which are highly weighted in the TF-IDF
model. These words are frequently associated with promotional scams or
phishing messages.

✅ Example 2:
• Input:
"Hi team, please find attached the meeting agenda for tomorrow."
• Prediction:
11
NOT SPAM
• Reason:
This message is formal and professional, with no suspicious or promotional
terms. It resembles typical work or business communication, so the model
correctly classifies it as a legitimate (ham) message.
Importance of This Function:
• This function acts as a real-time predictor for SMS messages, helping end-users
or systems detect spam instantly.
• It can be easily integrated into web applications, mobile apps, or chat systems to
filter out spam content.
• It showcases how machine learning can be used in practical, user-facing solutions
with minimal effort once the model is trained.
12

6. Discussion – Strengths, Weaknesses & Alternatives


6.1 Strengths of the Model 1. Simplicity and Interpretability
The Multinomial Naive Bayes algorithm is based on probability and is very easy
to understand and interpret. It doesn’t require complex tuning or infrastructure,
which makes it perfect for beginners and quick deployments.
2. High Speed and Efficiency
This model is extremely fast during both training and prediction, making it
suitable for real-time SMS filtering applications or environments with limited
computational power like mobile devices. 3. Strong Accuracy and
Precision
The spam detection model achieved an impressive accuracy of around 97.6%. Its
precision for spam detection is 92%, which means it rarely marks a legitimate
message as spam, making it reliable and user-friendly.

6.2 Weaknesses of the Model


4. Independence Assumption (Bag-of-Words Limitation)
The model treats every word as independent from the others, which is not always
true in natural language. This can result in missing the actual meaning or context
of a message.
5. Confusion in Ambiguous Messages
It struggles with SMS content that blends informal and promotional language.
For example, "Hey! You have a great offer waiting" could be spam or just a
message from a friend, which confuses the model.
6. Class Imbalance Bias
Since the dataset contains more “ham” messages than “spam”, the model tends to
favor predicting non-spam. This causes it to occasionally miss actual spam
messages, affecting recall.

6.3 Alternatives and Enhancements


7. Using N-grams for Context Understanding
Including bigrams and trigrams (like “win prize”, “free entry now”) helps the
model understand context better, which improves accuracy on tricky messages. 8.
Deep Learning Models (LSTM, BERT)
Advanced models like LSTM and BERT can learn the order and meaning of
words in a sentence. They perform better on messages with complex or hidden
spam indicators but require more computation. 9. Data Augmentation and
Balancing
13
Oversampling the minority spam class using techniques like SMOTE can help
reduce bias and improve model performance on rare but important spam
patterns.

7. Conclusion – Key Takeaways & Future Scope


The SMS spam detection system built using Multinomial Naive Bayes and TF-IDF shows
how a simple approach can still be powerful. It offers:

• Ease of implementation
• Fast and efficient results
• High accuracy with minimal resources
However, as spam messages become smarter and more subtle, this basic model might not
be enough. Future work should aim to:

• Include n-grams and syntactic features for better context.


• Switch to deep learning approaches for more semantic understanding.
• Use language detection for multi-lingual spam filtering.
• Expand dataset to handle modern threats like phishing or OTP fraud.
• Deploy the system in mobile or web apps for real-time SMS protection.
14

8. Acknowledgements
We would like to acknowledge the following contributions and resources that were
invaluable in completing this project:
1. Scikit-learn Documentation: For providing comprehensive and easy-to-follow
documentation on the Multinomial Naive Bayes algorithm and TF-IDF vectorizer,
which were essential for building the spam detection model.
2. UCI Machine Learning Repository: For providing the SMS Spam Collection dataset,
which served as the foundation for training and testing our model.
3. Python Libraries:
o NumPy and Pandas: For data manipulation and preprocessing.
o Matplotlib and Seaborn: For visualizations that helped analyze the
dataset.
o Scikit-learn: For implementing the machine learning algorithms and
vectorization techniques.
4. Deep Learning Community: For the inspiration to explore more complex models
(such as LSTM and BERT) that could enhance the spam detection system.
5. Support from Mentors and Colleagues: For their valuable feedback and
suggestions that helped in refining the approach and model.

9. References
1. Scikit-learn Documentation:
o Scikit-learn: Machine Learning in Python. Available at: https://scikit-
learn.org/stable/
2. SMS Spam Collection Dataset:
o Almeida, T. A., & Silva, A. (2011). SMS Spam Collection Dataset. UCI
Machine Learning Repository. Available at:
https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
3. TF-IDF Vectorization:
o Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to
Information Retrieval. Cambridge University Press.
4. Multinomial Naive Bayes Algorithm:
o Rish, I. (2001). An Empirical Study of the Naive Bayes Classifier. IJCAI-01
Workshop on Empirical Methods in Artificial Intelligence, 41-46.
5. N-grams for Text Classification:
15

o Zhou, M., & Sukhbaatar, S. (2017). Exploring the use of n-grams for
document classification. In Proceedings of the 2017 International
Conference on Machine Learning.
6. Deep Learning Models for NLP (LSTM, BERT):
o Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. A.,
Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. In Advances
in Neural Information Processing Systems, 30.
7. SMOTE and Data Augmentation:
o Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002).
SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial
Intelligence Research, 16, 321-357.

You might also like