SMS Spam Filtering using Supervised Machine Learning Algorithms
Pavas Navaney Gaurav Dubey Ajay Rana
Student, Amity University Assistant Professor, Amity University, Professor, Amity University,
Noida , Uttar Pradesh Noida , Uttar Pradesh Noida , Uttar Pradesh
Pavasnavaney@gmail.com gdubey@amity.edu ajay_rana@amity.edu
Abstract— This paper presents detection of Spam and ham We additionally fabricate models to group messages
messages using various supervised machine learning algorithms utilizing the SVM algorithm and the maximum entropy
like naïve Bayes Algorithm, support vector machines algorithm, algorithm [3], and it is discovered that SVM gives us the most
and the maximum entropy algorithm and compares their precise outcomes, with exactness up to 98 %, took after by
performance in filtering the Ham and Spam messages. As people
Naïve bayes algorithm, followed by maximum entropy
indulge more in Web-based activities, and with rising sharing of
private – data by companies, SMS spam is very common. SMS algorithm.
spam filter inherits much functionality from E-mail Spam Spam messages can be classified as redundant messages
Filtering. Comparing the performance of various supervised sent to large number of people at once. The rise of spam
learning algorithms we find the support vector machine messages are based on the following factors:
algorithm gives us the most accurate result. 1) The accessibility to cheap bulk SMS-plans; 2) dependability
(since the message comes to the cell phone client); 3) low
I.INTRODUCTION possibility of accepting reactions from some unaware
recipients; and 4) the message can be customized.5) Free
In the developing period of the Internet, individuals are services.
involving increasingly in free online services. Individuals tend
to share their data on different sites, though that data is
imparted to different organizations that spam individuals to II.BACKGROUND STUDY
offer their services. To construct the naïve Bayes classifier [4], we will use
SMS Spamming [2] [10] in extremely disappointing for the information and data collected from the SMS Spam collection
clients: numerous critical and valuable messages can get lost which is available openly and consists of about 5574records
because of spam messages, Spam messages are additionally [5].
used to trap individuals, or bait them into purchasing services. This dataset incorporates the content of SMS messages
As overall utilization of cell phones has grown, another road alongside a label signifying if the message is a ham or a spam.
for e-junk mail has been opened for notorious advertisers. Junk messages are marked as spam, while true blue messages
These publicists use instant messages (SMS) to target probable are marked as ham. A few cases of spam (Table 2) and ham
purchasers with undesirable publicizing known as SMS spam. (Table 1) are illustrated in the following illustration:
This sort of spam is especially bothersome since, not at all like
email spam, numerous PDA clients pay an expense for each 1. HAM MESSAGES
SMS got.
Building up a classification algorithm [1] [11] that channels Draft a reasonable one. And I will see if something can
SMS spam would give a helpful apparatus for mobile phone happen.
suppliers.
Since naïve Bayes has been utilized effectively for email spam Okay I can try, but cannot commit.
detection [9], it appears to be expected that it could likewise be
used to build SMS spam classifier [7]. With respect to email I am good too. Yes weekdays are busy, all thanks to
spam [6][8], SMS spam represents extra difficulties for office.
automated channels. SMS texts are regularly restricted to 160
characters, lessening the measure of content that can be utilized Table 1: Ham messages
to distinguish whether a message is a ham or spam. People
have also regularly started using shorthand notations and slang As watched these messages are the everyday messages that
which further makes it difficult to distinguish between ham and individuals trade with each other, these are not junk messages
spam. We will test how well a simple naïve Bayes classifier [4] and the client ought to get these messages with the spam filter
manages these difficulties. not separating them through.
c
978-1-5386-1719-9/18/$31.00 2018 IEEE 43
2. SPAM MESSAGES
III. ARCHITECTURE OF THE CLASSIFIER
Post Diwali offer! Get 30% off + Free Cloudbar with
select LED. Buy with your pre-approved loan.
Hi, good credit score makes you eligible for top loans
& credit cards. Get your score in 3 minutes.
Want chocolate? Get a whole-some Chocolate Shake
free on orders above Rs. 2000.
Table 2: Spam Messages
Taking a gander at the former specimen messages, we see
some recognizing qualities or some repeated patterns of spam
messages. One remarkable identification is that two of the three
spam messages use the word "free", yet the same word (free)
does not show up in any of the ham messages. Then again, two
of the ham messages refer to particular days of week, at the
point when contrasted with zero junk messages.
Our classifiers will exploit such examples in the word
recurrence to decide if the SMS messages appear to better fit
the profile of spam or ham. While it's not incomprehensible
that "free" would show up outside of a spam SMS, a ham
message is probably going to give extra words giving setting.
For example, a ham message may state "are you free on
Saturday?", while a spam message may utilize the expression
"free melodies and ringtones." The classifier will figure the
likelihood of spam and ham given the confirmation gave by
every one of the words in the message.
We have a total of 5574 records, out of which 4827
messages are ham and 747 messages are spam (Chart 1).
Flow-Diagram 1: Architecture of Spam Filter
As we have information in the crude shape in an excel record
file, we initially import the information. We have two columns
named "type" and "message". The message is the instant
message while the type is the classifier of the message which is
either ham or spam.
SMS messages are characters of content made out of words,
punctuations, numbers, and breaks. Taking care of this kind of
complex information takes a lot of attention and effort. We
need to think how to evacuate punctuation, numbers, handle
uninteresting words such as (and, or, but) which are called stop
words, and how to break separated sentences into singular
words. Gratefully, this utility has been given by individuals
from the R group in a text mining bundle titled "tm".
Chart 1: Ham v/s Spam The initial phase in preparing content information includes
making a corpus, which alludes to an accumulation of text
documents. For our understanding, a text document alludes to a
solitary SMS message.
After removing the stop words, punctuations, numbers and
blank spaces (Figure 1) we are ready to split the text messages
44 2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence)
into single terms in the form of a data structure which is called The data was then prepared by diving the dataset into training
sparse matrix. and testing datasets, with 75% of the messages used as the
training dataset and 25% was used as the testing dataset.
The training dataset consists of 4171 records and the testing
dataset consists of 1403 records.
IV. VISUALIZATION USING WORDCLOUDS
WordCloud is an approach to outwardly delineate the
recurrence at which words show up in information. The cloud
is comprised of words scattered fairly haphazardly around the
figure.
Words seeming all the more regularly in the content are
appeared in a bigger text style, while less normal terms are
Figure 1: Cleaning Before v/s after appeared in littler textual styles. This sort of figure has
developed in fame as of late since it gives an approach to watch
trending activities on social networking sites.
Since the information is handled to our preferring, the last
We compare the wordclouds of ham and spam messages and
advance is to divide the messages into singular parts through a
see the difference between the frequently occurring terms in
procedure called tokenization. A token is a single component
both the datasets.
of a content string; for this situation, the tokens are words.
The tokens are then represented in the form of the sparse
matrix, in which each cell in the matrix contains a number
indicating the count of a word that appears in a particular
sentence. The sparse matrix indicates the words in the columns
which the text messages are stored in the rows. The following
snapshot displays a small part of the DocumentTermMatrix;
the actual table contains 5574 rows and 7958 columns (Fig. 2).
Figure 3: Wordcloud for Spam
As we observe the most frequent occurring terms in the spam
messages are call, free, text, reply, claim etc. These are the
Figure 2: Document Term Matrix words that we generally encounter in spam messages.
As we can see that many of the cells above in the table are Contrasting the spam wordcloud (Fig.3) and the ham
filled with “No” which suggests that none of the above words wordcloud (Fig.4) will give us a thought regarding the
exist in the initial ten messages of the corpus. Hence this catchphrases that will be utilized by our classifiers in
observation is the main reason behind why this data structure is separating ham and spam. On the off chance that words present
called a sparse matrix; the majority of the cells of the network in the spam cloud likewise show up as often as possible in the
are filled with “No”. Albeit each message contains a few ham cloud, our classifier would not have solid watchwords for
words, the likelihood of a particular word showing up in correlation, while if the outcomes are distinctive, the models
guaranteed message is little. will have the capacity to separate amongst ham and spam well.
The entry “yes” in the sparse matrix shows that the words
available, bugis, cine, crazy, got and great are present in the
first text message.
2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 45
assumes class-conditional independence, which means that
the events are not dependent upon each other as long as they
are conditioned on similar class values. That this fact into
consideration allows us to simplify the above formula using the
probability rule for independent events, which is given by
(Eq.3):
P (AŀB) =P (A)*P (B) (3)
This result in a much simpler-to-compute equation,
demonstrated below:
P(spam|W1ŀ~W2)= P(W1|spam)P(~W2|spam)P(spam) (4)
P(W1) P(~W2)
Similarly the equation for a ham message will be given by:
Figure 4: Wordcloud for Ham
P(ham|W1ŀ~W2)= P(W1|ham)P(~W2|ham)P(ham) (5)
P(W1) P(~W2)
As we observe the most frequently occurring terms are
completely different from the spam wordcloud, with the words
occurring in the ham wordcloud being completely different
from the spam wordcloud. This difference suggests that our
classifiers will have strong keywords to differentiate between
ham and spam.
V. NAÏVE BAYES CLASSIFIER
We can characterize the issue as appeared in the accompanying n
1
formula, which catches the likelihood that a message is spam. P(CL | F1 ,..., Fn ) =
Z ∏ p(Fi | CL )
p(C L ) (6)
i =1
P(spam|W1ŀ~W2ŀW3)= P(W1ŀ~W2ŀW3|spam)P(spam) (1)
P(W1ŀ~W2ŀW3) Training the dataset with Naïve bayes model and comparing
the performance on test dataset, we make the following
Suppose that there are total three words in the corpus , now if CrossTable (Table 3).
in a sentence word W1 and W3 appears but W2 does not
appear , for finding the probability of spam , the naïve bayes Predicted Ham Spam Total
algorithm takes the probability of word W1occuring in spam (Messages) (Messages) (Messages)
sentences. That is by dividing the total occurrences of word (Percentage) (Percentage)
W1 in spam sentences divided by total occurrences of word Actual
W1 (Spam + Ham). 1205 22
Similarly we can calculate for probability of ham, which will Ham 98.2 1.7 1227
be given by the formula: 16 160
Spam 9.0 90.9 176
P(ham|W1ŀ~W2ŀW3)= P(W1ŀ~W2ŀW3|ham)P(ham) (2) 1221 182
P(W1ŀ~W2ŀW3) Total 87.0 13.0 1403
For numerous reasons this equation (Eq. II) is computationally
Table 3: Cross Table for Naïve Bayes classifier
very hard to solve. As more features are added, large amount of
memory is required to store the probabilities for the large part
Therefore we can see that the naïve bayes is 98.2% accurate
of the possible intersections.
in classifying a ham message and 90.9% accurate in
A large number of training data would also be needed to make
classifying a spam message. Therefore the naïve bayes
sure that sufficient information exists to cover all possible
algorithm gives an overall accuracy of 94.55%.
associations.
Our task becomes less tedious and memory efficient if we take
advantage of the fact that the naïve bayes algorithm assumes
independence between the events. Naïve bayes algorithm
46 2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence)
V. SVM CLASSIFIER As we observe that the maximum entropy algorithm gives us
the least accuracy in classifying the messages. The maximum
SVMs use a linear boundary called a hyper plane to partition entropy method gives an accuracy of 98% in classifying ham
data into groups of similar elements, typically as indicated by messages and 85.9% in classifying the spam messages. The
the class values. overall accuracy given by the maximum entropy method is
We train the model using the SVM algorithm and draw the 91.95 %( Table 5).
crosstable to compare its performance.
Predicted Ham Spam Total VII. CONCLUSION
(Messages) (Messages) (Messages)
As observed using the crosstable, the SVM algorithm gives
(Percentage) (Percentage)
the highest accuracy in terms of classifying ham and spam
Actual
messages, followed by naïve bayes method, and then
1215 20
Maximum Entropy method. Accuracy chart is illustrated in the
Ham 98.4 1.6 1235
below bar graph.
6 162
Spam 3.6 96.4 168
1221 182
Total 87.0 13.0 1403
Table 4: Cross Table for SVM classifier
As we observe in the crosstable, our SVM model performs
better than the naïve bayes model and classifies ham with an
accuracy of 98.4% and classifies spam with 96.4%, giving an
overall accuracy of 97.4 %(Table 4).
VI. MAXIMUM ENTROPY CLASSIFIER
The principle behind Maximum Entropy is that the correct
distribution is one that maximizes the Entropy or the Figure 5: Comparison of Accuracy
uncertainty and still meets the constraints which are set by the
‘evidence’. The mathematical formula for entropy is given by Therefore we can safely conclude that building an SMS spam
classifier using SVM algorithm gives us the best results
H ( p) = −¦ p(a, b) log p(a, b) (7) possible with an accuracy of 97.4%.(Fig. 5).
So the most likely probably distribution P is one that
maximizes the entropy: VIII. REFERENCES
p = arg max H ( p ) (8)
[1]Michael Crawford,Taghi M. Khoshgoftaar,Joseph D. Prusa,
Aaron N. Richter andHamzah Al Najada, “Survey of Review spam
We train the model using the Maximum Entropy classifier and detection using machine learning techniques”, Journal of Big Data
draw the crosstable to compare its performance. 2015
Predicted Ham Spam Total [2] R Deepa Lakshmi , N. Radha , “Spam Classification using
supervised learning techniques”, A2CWiC’10 Proceedings of the 1st
(Messages) (Messages) (Messages)
Amrita ACM-W Celebration of Women in Computing in India,
(Percentage) (Percentage) Article No. 66
Actual
1195 24 [3] Anju Radhakrishnan et al, “Email Classification using Machine
Ham 98.0 2.0 1219 learning algorithms”, International Journal of Engineering and
26 158 technology(IJET).
Spam 14.1 85.9 184
1221 182 [4] Dea Delvia Arifin ,Shaufiah , Moch. ArifBijaksana , “Enhancing
Spam Detection on mobile phone short message service(SMS)
Total 87.0 13.0 1403
performance using FP-Growth and naïve bayes classifier” , Wireless
and Mobile (APWiMob), 2016 IEEE Asia Pacific Conference(2016).
Table 5: Cross Table for Max. Entropy Classifier
2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 47
[5] J.M. Gómez Hidalgo, T.A. Almeida, andA. Yamakamim “ On the [9] S. P. Teli and S. K. Biradar, “Effective Email Classification for
Validity of a New SMS Spam Collection” , Proceedings of the 11th Spam and Non- spam”, International Journal of Advanced
IEEE InternationalConference on Machine Learning and Research in Computer and software Engineering, Vol. 4, 2014
Applications, (2012.)
[10] Shafi’l Muhammad Abdulhamid , “A Review on Mobile SMS
[6] H. Kaur , “Survey on E-mail spam detection using supervised Spam Filtering Techniques”, IEEE Access, 2017.
approach with feature selection” , International Journal of
Engineering Sciences and Research Technology. [11] Naresh Kumar Nagwani , Aakanksha Sharaff , “SMS Spam
Filtering and thread identification using bi-level text classification
[7] Rekha and S. Negi, “A Review on Different Spam Detection and clustering techniques”, Journal of Information Science , 2017.
Approaches”, International Journal of Engineering Trends and
Technology (IJETT), Vol.11, No.6, 2014.
[8] A. S. Aski and N. K. Sourati, “Proposed efficient algorithm to
filter spam using machine learning techniques”, Pacific Science
Review- A Natural Science Engineering- Elsevier, Vol. 18, No. 2,
Pp.145– 149, 2016.
48 2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence)