0% found this document useful (0 votes)

3 views8 pages

Dendritic Cell Algorithm For Mobile Phone Spam Filtering: Sciencedirect

This paper presents a Dendritic Cell Algorithm (DCA) for filtering spam in mobile phone text messaging services, addressing the growing issue of SMS spam and its associated security risks. The authors evaluate various content-based feature sets and combine the results of two machine learning algorithms, Naïve Bayes and Support Vector Machines, to enhance spam detection accuracy. Empirical results demonstrate significant improvements in the model's performance in classifying spam and legitimate messages using the proposed DCA-based approach.

Uploaded by

fatna.elmendili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views8 pages

Dendritic Cell Algorithm For Mobile Phone Spam Filtering: Sciencedirect

Uploaded by

fatna.elmendili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Available online at www.sciencedirect.

com

ScienceDirect
Procedia Computer Science 52 (2015) 244 – 251

6th International Conference on Ambient Systems, Networks and Technologies, ANT 2015

Dendritic Cell Algorithm for Mobile Phone Spam Filtering

Ali A. Al-Hasana , El-Sayed M. El-Alfyb,∗
a Saudi Aramco, Dhahran, Saudi Arabia
b College of Computer Sciences and Engineering, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia

Abstract
With the revolution of mobile devices and their applications, significant improvements have been witnessed over years to support
new features in addition to normal phone communication including web browsing, social networking and entertainment, mobile
payment, medical and personal records, e-learning, and rich connectivity to multiple networks. As mobile devices continue to
evolve, the volume of hacking activities targeting them also increases drastically. Receiving short message spam is one of the
common vectors for security breaches. Besides wasting resources and being annoying to end-users, it can be used for phishing
attacks and as a vehicle for other malware types such as worms, backdoors, and key loggers. The next generation of mobile
technologies has more emphasis on security-related issues to protect confidentiality, integrity and availability. This paper explores
a number of content-based feature sets to enhance the mobile phone text messaging services in filtering unwanted messages (a.k.a.
spam). Moreover, it develops a more effective spam filtering model using a combination of most relevant features and by fusing
decisions of two machine learning algorithms with the Dendritic Cell Algorithm (DCA). The performance has been evaluated
empirically on two SMS spam datasets. The results showed that significant improvements can be achieved in the overall accuracy,
recall and precision of spam and legitimate messages due to the application of the proposed DCA-based model.

©c 2015
2015 Published by Elsevier
The Authors. B.V.
Published byThis is an B.V.
Elsevier open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Conference Program Chairs.
Peer-review under responsibility of the Conference Program Chairs
Keywords: Mobile Technology; Smartphones; Short Message Service (SMS); Dendritic Cell Algorithm (DCA); Spam Detection and Filtering;
Application Security.

1. Introduction

Nowadays, with the advances in mobile technology, end users are accessing their emails, surfing the world-wide
web, making video & voice calls, using text chatting, gaming and more through their smartphones. The number
of mobile users is increasing significantly over time with almost seven billion cellular subscriptions worldwide 1 .
Mobile devices are now likely to contain personal and confidential information such as credit card numbers, contact
lists, emails, medical records and other sensitive documents. Unlike desktop applications, effective security controls to
protect mobile devices are not mature enough and is an active area of research. This can be attributed limited resources
and processing power, and lack of knowledge and awareness of many end users regarding protection mechanisms.
These reasons and more are making mobiles very attractive to cyber attacks. Hackers can utilize the compromised

∗ Corresponding author. (On leave from Tanta University, College of Engineering, Egypt).
E-mail address: alfy@kfupm.edu,sa

1877-0509 © 2015 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Conference Program Chairs
doi:10.1016/j.procs.2015.05.067
Ali A. Al-Hasan and El-Sayed M. El-Alfy / Procedia Computer Science 52 (2015) 244 – 251 245

mobiles to make calls to premium numbers without the end-users’ permission, stealing contact data, or participating
in botnet activities.
Exchanging short text messages (SMS) among mobile phones is very convenient and frequently used for commu-
nication on a daily basis. Subsequently, the number of unwanted SMS messages (spam) is growing. In 2012, there
were 350,000 variants of SMS spam globally 2 . SMS has been considered a serious security threat since early 2000s 3 .
For example, hackers can send phishing attacks to collect confidential information or launch other types of attacks.
The risk of SMS spam could be operational or financial loss. It is getting easier to target end users through SMS
than electronic mails (emails) since the mail service is more mature, and more effective email spam fighters have
been developed and deployed by service providers and users. Unfortunately, this is not the case with the SMS spam.
The controls that are used by mobile phones to block SMS spam are not as effective as email anti-spammers. It is
a challenging task since SMS messages have limited sizes which means less statistically-distinguishing information.
Recently, several methods have been investigated to detect SMS spam, including content-based approaches 3–7 . How-
ever, the accuracy is still relatively low and further research is required to investigate new features and new ways of
calculating and utilizing them.
In this paper, we analyze several feature sets and study their impact on two machine learning algorithms. Then,
we combine the top two relevant feature sets and build a more effective model. Inspired by the danger theory and the
immune-based systems, we propose a novel approach based on the Dendritic Cell Algorithm (DCA) for fusing the
results of Naı̈ve Bayes (NB) and Support Vector Machines (SVM). DCA is a relatively recent approach in machine
learning 8 . Using two SMS datasets, we evaluate and compare the effectiveness of the individual feature sets and the
proposed fused model.
The remainder of this paper is organized as follows. Section 2 describes the methodology and Section 3 presents
the empirical analysis and results. Finally, Section 4 concludes the paper.

2. Methodology

The generic framework for fighting against textual SMS spam is typically treated as a document categorization
problem where individual messages are preprocessed and represented by feature vectors. Then, statistical or machine
learning models are built using a training corpus to determine the category for each received message to be spam
or legitimate (ham). Differences among various approaches are mainly in how messages are transferred to feature
vectors and how classification takes place. The details of the main phases of the proposed model are provided in the
following subsections.

2.1. Corpus Analysis and Representation

2.1.1. Enrichment
To enrich the SMS, we added two types of semantic information tagging: part-of-speech (POS) and recognized
entities tags. The POS tags are the linguistic categories of words. We assign the POS tags using the Penn Treebank
tag set (http://www.cis.upenn.edu/~treebank/). Examples of the possible tags are nouns, verbs, adjectives
and adverbs. We only extracted the part-of-speech tags for the ﬁrst and last terms in each message as features since
they describe embedded grammatical structure that is unlikely to vary for each spammer or author 9 . The other type
of tags corresponds to recognized named entities using the OpenNLP model (https://opennlp.apache.org/).
These entities include location, organization, money, date, person and time 10 .

2.1.2. Preprocessing
The preprocessing phase includes the following steps. First, the SMS message is converted into lowercase char-
acters before being passed to the next stage. Second, each SMS message is treated as a string and then divided into
distinct tokens (words). Third, each word is reduced to its root by removing all suﬃxes and preﬁxes such as ‘tion’,
‘ing’ and ‘er’. We used the Porter stemming algorithm to achieve this task 11 .

2.1.3. Feature Extraction

Feature extraction is a very crucial task for the SMS classiﬁcation. It should not require complex analysis in order
not to signiﬁcantly delay the messaging service. But extracted features should also be highly correlated to the message
246 Ali A. Al-Hasan and El-Sayed M. El-Alfy / Procedia Computer Science 52 (2015) 244 – 251

category to enhance the spam detection accuracy. As a result, each message is represented with a vector denoted as
X = (x1 , x2 , x3 , . . . , xm ), where m is the number of features and xi for i = 1, ..., m represents the weight of the i-th
feature to that message. In our work, we extracted and evaluated the following feature sets for SMS spam detection:

• URL Link: We normalize all URL links within SMS messages by replacing them with a single word (e.g.
httpLink). We consider the number of URLs in the SMS message as a feature since malicious spam SMS likely
asks the user to click on a link to visit a website for a prize or to download an application.
• Spam Words: A set of words and phrases are most commonly used by spammers 6 ; see Table 1 for examples.
We used the number of spam words that exist in an SMS as a feature . Our list consists of 350 terms collected
from various sources and blogs that are publicly available on the web.
• Emotion Symbols: The existence of emotion symbols and icons may be a good indicator for legitimate SMS
messages. Examples of these symbols are happy, angry or sad faces. We used regular expressions to extract
these symbols.
• Special Characters: Spammers might use special characters for various reasons such as by-passing simple
filters based on keywords. For example, the dollar signs “$$$” can be used instead of money in prize or finance
related messages. We used regular expressions to extract these features.
• Message Metadata: This feature set includes message length, which is the overall byte length of SMS, number
of tokens and average token length.
• Function Words or Grammatical Words: These are non-content words that have little lexical meaning or have
ambiguous meaning, but exist to explain structural or grammatical relationships with other words within a
sentence or specify the mood or attitude of the author. Function words form a closed class of words that
is fixed and has a relatively small size. For example, Koppel and Ordan 12 used 300 function words for the
English language from LIWC 13 . Function words are lexically unproductive and are generally invariable in
form. Examples of function words are prepositions, pronouns, determiners, conjunctions, auxiliary verbs, and
particles; see Table 2. We evaluated function words features because they are very unlikely to be subject to
conscious control by an author. This is due to their high frequency of use and highly grammatical role 14 . We
relied on the word list available in 15 .

In addition to these feature sets, we included two other feature sets calculated during the enrichment phase which
are POS tags of the ﬁrst and last terms in each SMS and the named entity tags (referred to as All tags).

Table 1. Examples of common spam words and phrases.

credit, loan, bills, info, money, investment, discount , win, order now, sign up, clearance, earn, free gift, free samples
dating, ﬁnd, guess, statement, private, dear, partner, singles, fast cash, incredible deal, free info, satisfaction, buy direct
call free, call now, camcorder, phone, cards, extra inches, cialis, viagra, spa, beauty, money back, click here, act now
prize, guaranteed, claim, cash, no fees, limited time, life insurance, mortgage, amazing, 100% satisﬁed, 100% free

Table 2. Examples of function words.

Class Size Examples

Prepositions 124 of, at, in, on, for, without, between, besides, close to, down
Pronouns 70 he, she, you, him, her, our, anybody, it, one
Determiners 28 a, the, all, both, either, neither, some, those, every
Conjunctions 44 and, after, hence, however, that, when, while, although, or, yet
Auxiliary and modal verbs 17 may, had better, used to, might, shall, be able to, can, must
Quantiﬁers >86 no, none, one, two, much, many, the whole, part, various
Ali A. Al-Hasan and El-Sayed M. El-Alfy / Procedia Computer Science 52 (2015) 244 – 251 247

Algorithm 2: DCA Learning Algorithm

Data: Antigens and Signals(PAMP,
Algorithm 1: Generation of DCA Signals S a f e, Danger)
Result: Antigens and their MCAV values
Data: S V Mc and NBc decisions; S V Mc f
begin
and NBc f conﬁdences
initialize DC;
Result: signals: PAMP, S a f e, Danger
while there is input do
begin
if Antigen then
PAMP=0, S a f e=0, Danger=0;
Expose DC to Antigen;
if S V Mc ==NBc then
else if Signals then
if S V Mc ==“Spam” then
calculate K and CS M;
PAMP= Max(S V Mc f ,NBc f );
update DC;
else
S a f e= Max(S V Mc f ,NBc f ); end
end if DC lifespan < 0 then
reset DC;
else
end
Danger= Avg(S V Mc f ,NBc f );
end end
for each Antigen type do
end calculate MCAV
end
end

2.2. DCA-Based Classiﬁcation

The Dendritic Cell Algorithm (DCA) is a recent immune-inspired classiﬁcation algorithm developed based on the
behavior and function of Dendritic Cells (DCs) in the biological immune system 8,16 . The algorithm was successfully
applied to solve a number of classiﬁcation problems in various domains, e.g. 8,16,17 . It starts with a collection of DCs
each of which is exposed to antigens (objects) and environmental signals. Below, we describe a novel approach for
generating signals from the feature vectors. Then, we show how the DCA algorithm utilizes these signals to detect
SMS spam messages.

2.2.1. Generation of DCA Signals

In DCA algorithm, there are three types of signals: PAMP, Danger and S a f e. The PAMP signal is a measure of
confidence that the antigen represents a spam. The Danger signal is a measure which indicates a potential abnormality.
Its value increases as the confidence of the monitored system being in abnormal status increases accordingly. Finally,
the S a f e signal is a measure that increases in value in conjunction with legitimate messages. It represents a confidence
indicator of normal, predictable or steady-state system behavior. To generate these required signals, we combined the
outputs of two different machine learning algorithms: Naı̈ve Bays (NB) and Support Vector Machine (SVM). The
pseudo-code of this process for signal generation is outlined in Algorithm 1. For a particular message, each classifier
takes the feature vector representing the message as input and generates a decision with a confidence level. Since the
PAMP signal indicates high level of assurance of an anomalous situation, it is generated using the highest confidence
level of the two classifiers when both agree that the antigen is spam. The second type of signals is the presence of
Danger signals, which may or may not indicate an anomalous situation. However, the probability of an anomaly is
higher than under normal circumstances. Hence, we used the average confidence level of the two classifiers when
both disagree on the antigen classification. Finally, the presence of the S a f e signal indicates that no anomalies are
present. In our case, if the two classifiers agreed that the antigen is non-spam, we utilized the highest confidence level
of the two classifiers to be the S a f e signal. The derived signals and associated antigens passed to the DCA algorithm
as input.

2.2.2. Dendritic Cell Algorithm

A high level view of the main steps in the DCA algorithm is shown in Algorithm 2. This algorithm starts with a
population of dentritic cells (DCs) 8 . Each DC has a diﬀerent lifespan which is initialized to some random value then
changes over time based on the exposure to antigens and signals. The combination of signals and antigen temporal
correlation and diversity of the DC population is responsible for the detection capability of the DCA. The maximum
number of antigens that should be collected by a single DC is determined by concentration of co-stimulatory molecules
(CS M) which is initially assigned randomly to each DC. When a threshold value of the CS M is reached, the DC is
248 Ali A. Al-Hasan and El-Sayed M. El-Alfy / Procedia Computer Science 52 (2015) 244 – 251

migrated and transformed to a mature or semi-mature state. The transformation is based on the overall abnormality
of signals seen by a dendritic cell which is denoted as K. At a particular exposure n, the impact of the three types of
signals on CS M and K is calculated using the following formulas:

ΔCS M = PAMPn × wc p + Dangern × wcd + S a f en × wc s (1)

ΔK = PAMPn × wk p + Dangern × wkd + S a f en × wk s (2)

where PAMPn , Dangern , and S a f en are the input signals, wc p , wcd , and wc s are weights associated with CS M,
and wk p , wkd , and wk s are weights associated with K. DCs are classiﬁed as mature or semi-mature based on the
accumulated values of CS M and K as shown in Figure 1.

Fig. 1. Classiﬁcation of DC as mature or semi-mature

The ﬁnal decision to classify an antigen as S pam or Legitimate is made based on the number of DCs that are
fully mature. This is done by computing a mature context antigen value (MCAV). This value gives a probability of
a pattern being anomalous. The closer this value to 1, the greater the probability that the antigen is anomalous. To
overcome the problem of antigen deﬁciency and to ensure that it appears in several contexts, each antigen is sampled
multiple times using the antigen multiplier parameter of the DCA 17 . The DCA calculates the MCAV value for each
antigen type using the following formula:

Mi
MCAV = (3)
Ag
where i refers to the antigen type (spam), Mi refers to the number of times that antigen appears in the mature context

and Ag is the total number of antigens. The MCAV value is then used to classify the SMS by comparing it to an
anomaly threshold that is calculated from:
an
at = (4)
tn
where at is the derived anomaly threshold, an is the number of anomalous data items and tn is the total number of
data items. The classiﬁcation rule applied on the i-th message is as follows:
⎧
⎪
⎨Spam, i f MCAV > at
f (x) = ⎪
⎩Legitimate, otherwise

3. Experimental Work

3.1. SMS Datasets

We used two datasets to evaluate and compare the eﬀectiveness of the proposed short message detection model.
These datasets are publicly available and widely used in some other published work in the literature. Table 3 shows a
brief summary of these datasets and detailed descriptions are presented next.
Ali A. Al-Hasan and El-Sayed M. El-Alfy / Procedia Computer Science 52 (2015) 244 – 251 249

Table 3. Benchmark spam ﬁltering datasets (total number of SMS instances, number of spam instances, number of legitimate instances, number of
tokens per messages (TPM)).

Dataset# Description # SMS instances # Spam instances # Legitimate instances # TPM

Dataset1 SMS Spam Corpus V.0.1 Big 1,324 322 1,002 15.72
Dataset2 SMS Spam Collection V.1 5,574 747 4,827 14.56

3.1.1. Dataset#1: SMS Spam Corpus V.0.1 Big

This corpus is a collection of 1,002 legitimate messages and 322 spam SMSs in English language. The legitimate
SMS messages were randomly selected from the National University of Singapore (NUS) SMS corpus (10,000 le-
gitimate SMSs) and the Jon Stevenson corpus (202 legitimate SMSs). The spam messages were collected manually
from the Grumbletext Website, which is a public UK forum where users claims SMS spam messages. The average
word length is 4.44 characters and the average number of words per message is 15.72 5 . This dataset is available at
(http://www.esp.uem.es/jmgomez/smsspamcorpus/) and has been used in 5,7,18 .

3.1.2. Dataset#2: SMS Spam Collection V.1

This corpus is a collection of spam and legitimate messages publicly available in raw format at (http://www.
dt.fee.unicamp.br/~tiago/smsspamcollection/) and is also hosted at the UCI machine learning repository.
There are a total of 5,574 SMS messages in English gathered from four free or free for research sources: Grumbletext
Website (425 SMS), Caroline Tag’s PhD Theses (450 SMS), National University of Singapore (3,375 SMS) and Jon
Stevenson Corpus (1,324 SMS). The corpus has a total of 4,827 legitimate messages and 747 spam messages. This
corpus is described and analyzed in 4 and has been recently used in 19 .

3.2. Evaluation Measures

The eﬀectiveness is evaluated in terms of the percentage detection accuracy which is calculated from:
TP + TN
ACC = × 100 (5)
T P + FP + FN + T N
where ACC is the accuracy, T P is the number of true positives, T N is the number of true negatives, FP is the number
of false positives and FN is the number of false negatives. We also computed the percentage recall (REC), precision
(PRE) and F-measure (F) for each category. Moreover, we computed the area under the ROC curve (AUC).

3.3. Experiments and Discussions

We first performed a series of experiments to evaluate the individual feature sets extracted from both datasets. Two
types of machine learning algorithms are used: Support Vector Machine (SVM) and Naı̈ve Bayes (NB). The results
are shown in Table 4 and 5 for SVM and NB, respectively. The performance is recorded for 10-fold cross validation
in terms of the precision (PRE), recall (REC) and F-measure (F) for each message category. The tables also show the
percentage overall accuracy (ACC) and the area under the ROC curve (AUC) for each case. Analyzing these results,
we found that there are two dominating feature sets with very high AUCs. These feature sets are the ‘Spam Words
(SW)’ and ‘Metadata (MD)’. They are more relevant to the classification process and their combination may yield
better results. We then merged the two feature sets and rebuilt the classifiers to find out that this combination resulted
in improving the effectiveness of both classifiers on both datasets. From the computational prospective, it will be
better to combine only two feature sets rather than combining all the feature sets.
In order to demonstrate the effectiveness of the proposed DCA-based algorithm for SMS spam detection, we carried
out the experiment again for both datasets. To adjust the DCA parameters, we ran several experiments with different
values for the number of DCs, the Antigen Multiplier, and the signal weights. To manage the paper space, we only
provide the best performance attained in Table 6 and the corresponding parameters are listed in Table 7. For the sake
of comparison, we also show the best results obtained for SVM and NB in Table 6. It can observed that significant
improvement is achieved by applying the proposed approach yet with only two most relevant feature sets.
250 Ali A. Al-Hasan and El-Sayed M. El-Alfy / Procedia Computer Science 52 (2015) 244 – 251

Table 4. SVM classiﬁcation results

Spam Legitimate
Dataset Feature Set PRE REC F PRE REC F AUC ACC
URL 0.933 0.138 0.235 0.788 0.998 0.881 0.567 78.85
Spam words (SW) 0.985 0.810 0.887 0.945 0.996 0.970 0.983 95.16
Emotion symbols 0.000 0.000 0.000 0.763 1.000 0.865 0.500 75.68
Special characters 0.000 0.000 0.000 0.762 0.998 0.864 0.606 75.53
Dataset1 All tags 0.689 0.503 0.576 0.857 0.926 0.890 0.717 82.33
First and last terms POS 0.000 0.000 0.000 0.763 1.000 0.865 0.500 75.68
Metadata (MD) 0.854 0.843 0.847 0.951 0.954 0.953 0.967 92.60
Function words 0.579 0.497 0.530 0.851 0.887 0.868 0.845 79.23
Combined(SW,MD) 0.978 0.871 0.921 0.962 0.994 0.977 0.993 96.45
URL 0.956 0.144 0.248 0.883 0.999 0.937 0.571 88.43
Spam words (SW) 0.922 0.757 0.831 0.964 0.990 0.977 0.959 95.89
Emotion symbols 0.000 0.000 0.000 0.866 1.000 0.928 0.500 86.60
Special characters 0.000 0.000 0.000 0.866 1.000 0.928 0.399 86.60
Dataset2 All tags 0.587 0.162 0.254 0.883 0.982 0.930 0.541 87.17
First and last terms POS 0.000 0.000 0.000 0.866 1.000 0.928 0.500 86.60
Metadata (MD) 0.712 0.456 0.554 0.920 0.972 0.945 0.887 90.22
Function words 0.000 0.000 0.000 0.866 1.000 0.928 0.487 86.60
Combined(SW,MD) 0.914 0.775 0.838 0.966 0.989 0.977 0.973 96.02

Table 5. Naı̈ve Bayes classiﬁcation results

Spam Legitimate
Dataset Feature Set PRE REC F PRE REC F AUC ACC
URL 0.961 0.140 0.240 0.789 0.998 0.881 0.567 78.85
Spam words (SW) 0.935 0.923 0.928 0.976 0.979 0.978 0.983 96.60
Emotion symbols 0.240 1.000 0.387 0.600 0.013 0.025 0.500 25.30
Special characters 0.525 0.221 0.305 0.795 0.937 0.860 0.753 76.36
Dataset1 All tags 0.553 0.610 0.556 0.847 0.788 0.787 0.731 74.31
First and last terms POS 0.615 0.419 0.497 0.836 0.920 0.876 0.801 79.83
Metadata (MD) 0.653 0.894 0.752 0.963 0.847 0.901 0.948 85.88
Function words 0.571 0.545 0.556 0.860 0.870 0.865 0.848 79.15
Combined(SW,MD) 0.855 0.949 0.899 0.984 0.949 0.966 0.983 94.79
URL 0.948 0.143 0.248 0.883 0.999 0.937 0.500 88.43
Spam words (SW) 0.737 0.863 0.794 0.978 0.952 0.965 0.960 94.03
Emotion symbols 0.141 1.000 0.247 1.000 0.059 0.111 0.529 18.50
Special characters 0.080 0.012 0.021 0.866 0.985 0.922 0.731 85.47
Dataset2 All tags 0.446 0.498 0.470 0.921 0.904 0.912 0.712 84.95
First and last terms POS 0.692 0.169 0.270 0.885 0.988 0.934 0.767 87.82
Metadata (MD) 0.548 0.809 0.653 0.968 0.896 0.931 0.925 88.45
Function words 0.000 0.000 0.000 0.866 1.000 0.928 0.822 88.60
Combined(SW,MD) 0.835 0.863 0.848 0.979 0.973 0.976 0.967 95.86

Table 6. Comparison of DCA with best performance of SVM and NB

Spam Legitimate
Dataset Approach PRE REC F PRE REC F AUC ACC
Proposed 1.000 0.991 0.995 0.997 1.000 0.999 0.999 99.77
Dataset1 SVM 0.978 0.871 0.921 0.962 0.994 0.977 0.993 96.45
NB 0.855 0.949 0.899 0.984 0.949 0.966 0.983 94.79
Proposed 1.000 0.996 0.998 0.999 1.000 1.000 0.999 99.95
Dataset2 SVM 0.914 0.775 0.838 0.966 0.989 0.977 0.973 96.02
NB 0.835 0.863 0.848 0.979 0.973 0.976 0.967 95.86

4. Conclusions

With the evolution of mobile technology and the increased dependence on smart devices, the number of spam
SMS messages is unprecedentedly growing. Spam is not only annoying but it can be a vehicle for more severe
security threats and information leakage as well. To control this problem, we analyzed and evaluated several feature
Ali A. Al-Hasan and El-Sayed M. El-Alfy / Procedia Computer Science 52 (2015) 244 – 251 251

Table 7. DCA best parameters used.

Parameters Values for Dataset1 Values for Dataset2

Number of DCs 40 30
Antigen multiplier 80
⎧ 50
⎪
⎪
⎨ΔCS M = 2 × PAMP + S a f e + Danger
Signals weights ⎪
⎪
⎩ΔK = 2 × PAMP − 3 × S a f e + Danger

sets, which can be easily extracted from the received messages, using two machine learning algorithms. We also
explored the impact of combing the two most relevant sets on the performance of the machine learning algorithms.
Subsequently, we developed a novel approach based on DCA that fuses the output from two classifiers. The empirical
results showed significant improvement can be achieved when applying the proposed approach (with close to 100%
accuracy). As future work, we are planning to compare it with other models and test it on different datasets.

Acknowledgment

The second author would like also to acknowledge the support provided by King Abdulaziz City for Science and
Technology (KACST) through the Science & Technology Unit at King Fahd University of Petroleum & Minerals
(KFUPM) for funding this work under project No. 11-INF1658-04 as part of the National Science, Technology and
Innovation Plan.

References

1. Sanou, B. The world in 2014: ICT facts and figures. 2014. https://www.itu.int/en/ITU-D/Statistics/Documents/facts/
ICTFactsFigures2014-e.pdf.
2. Baldwin, C. 350,000 different types of spam SMS messages were targeted at mobile users in 2012. 2013. http://www.computerweekly.
com/news/2240178681/.
3. Sohn, D.N., Lee, J.T., Han, K.S., Rim, H.C. Content-based mobile spam classification using stylistically motivated features. Pattern
Recognition Letters 2012;33(3):364–369.
4. Almeida, T.A., Hidalgo, J.M.G., Yamakami, A. Contributions to the study of SMS spam filtering: new collection and results. In:
Proceedings of the 11th ACM Symposium on Document Engineering. 2011, p. 259–262.
5. Cormack, G.V., Gómez Hidalgo, J.M., Sánz, E.P. Spam filtering for short messages. In: Proceedings of the 16th ACM Conference on
Information and Knowledge Management. 2007, p. 313–320.
6. Delany, S.J., Buckley, M., Greene, D. SMS spam filtering: methods and data. Expert Systems with Applications 2012;39(10):9899–9908.
7. Gómez Hidalgo, J.M., Bringas, G.C., Sánz, E.P., Garcı́a, F.C. Content based SMS spam filtering. In: Proceedings of the ACM Symposium
on Document Engineering. 2006, p. 107–114.
8. Greensmith, J., Aickelin, U., Cayzer, S. Detecting danger: The dendritic cell algorithm. In: Robust Intelligent Systems. 2008, p. 89–112.
9. Wright, W.R., Chin, D.N. Personality profiling from text: Introducing part-of-speech n-grams. In: User Modeling, Adaptation, and
Personalization. Springer; 2014, p. 243–253.
10. Kim, M.H., Compton, P. Improving the performance of a named entity recognition system with knowledge acquisition. In: Knowledge
Engineering and Knowledge Management. Springer; 2012, p. 97–113.
11. Hull, D.A. Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Science - Special
Issue: Evaluation of Information Retrieval Systems 1996;47(1):70–84.
12. Koppel, M., Ordan, N. Translationese and its dialects. In: Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics; 2011, p. 1318–1326.
13. Pennebaker, J.W., Francis, M.E., Booth, R.J. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates
2001;71:2001.
14. Argamon, S., Levitan, S. Measuring the usefulness of function words for authorship attribution. In: Proc. ACH/ALLC Conference. 2005.
15. Gilner, L., Morales, F. Function words. 2014. http://www.sequencepublishing.com.
16. Greensmith, J., Aickelin, U. The deterministic dendritic cell algorithm. In: Artificial Immune Systems. Springer; 2008, p. 291–302.
17. Huang, R., Tawfik, H., Nagar, A. Artificial dendritic cells algorithm for online break-in fraud detection. In: Proceedings of the 2nd IEEE
International Conference on Developments in eSystems Engineering (DESE). 2009, p. 181–189.
18. Cormack, G.V., Hidalgo, J.M.G., Sánz, E.P. Feature engineering for mobile (SMS) spam filtering. In: Proceedings of the 30th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval; 2007, p. 871–872.
19. Almeida, T., Hidalgo, J.M.G., Silva, T.P. Towards sms spam filtering: Results under a new dataset. International Journal of Information
Security Science 2013;2(1):1–18.

Abh 1
No ratings yet
Abh 1
17 pages
Chatanya
No ratings yet
Chatanya
43 pages
Ijsse 14.01 28
No ratings yet
Ijsse 14.01 28
8 pages
Fortified
No ratings yet
Fortified
40 pages
Sms Spam
No ratings yet
Sms Spam
7 pages
A Review On Mobile SMS Spam Filtering Techniques
No ratings yet
A Review On Mobile SMS Spam Filtering Techniques
21 pages
Solution: March 2018
No ratings yet
Solution: March 2018
8 pages
Fedspam: Privacy Preserving Sms Spam Prediction
No ratings yet
Fedspam: Privacy Preserving Sms Spam Prediction
12 pages
Sms Spam Term Paper
No ratings yet
Sms Spam Term Paper
10 pages
Applied Sciences: A Discrete Hidden Markov Model For SMS Spam Detection
No ratings yet
Applied Sciences: A Discrete Hidden Markov Model For SMS Spam Detection
17 pages
Spam SMS Filtering Based On Text Features and Supervised Machine Learning Techniques
No ratings yet
Spam SMS Filtering Based On Text Features and Supervised Machine Learning Techniques
19 pages
Content-Based Sms Spam Filtering Using Machine Learning Technique
No ratings yet
Content-Based Sms Spam Filtering Using Machine Learning Technique
7 pages
Anchalora
No ratings yet
Anchalora
29 pages
IJNRD2403165
No ratings yet
IJNRD2403165
5 pages
PDFF
No ratings yet
PDFF
15 pages
Spam Detection with Python
No ratings yet
Spam Detection with Python
26 pages
SMS Spam Detection with NLP
No ratings yet
SMS Spam Detection with NLP
21 pages
SMS Spam Detection with ML Algorithms
No ratings yet
SMS Spam Detection with ML Algorithms
4 pages
Intern 2
No ratings yet
Intern 2
26 pages
Hybrid Spam Filtering For Mobile Communication: Ji Won Yoon, Hyoungshick Kim, Jun Ho Huh
No ratings yet
Hybrid Spam Filtering For Mobile Communication: Ji Won Yoon, Hyoungshick Kim, Jun Ho Huh
14 pages
Sms Spam Detcetion Review Paper
No ratings yet
Sms Spam Detcetion Review Paper
4 pages
Investigating Evasive Techniques in Sms Spam Filtering A Comparative Analysis of Machine Learning Models Ijariie26436
No ratings yet
Investigating Evasive Techniques in Sms Spam Filtering A Comparative Analysis of Machine Learning Models Ijariie26436
10 pages
Spam SMS (Or) Email Detection and Classification Using Machine Learning
No ratings yet
Spam SMS (Or) Email Detection and Classification Using Machine Learning
5 pages
Development of Content-Based SMS Classification Application by Using Word2Vec-based Feature Extraction
No ratings yet
Development of Content-Based SMS Classification Application by Using Word2Vec-based Feature Extraction
10 pages
SMS Spam Filtering for Academics
No ratings yet
SMS Spam Filtering for Academics
6 pages
Batch 6
No ratings yet
Batch 6
6 pages
Boyer Moore String-Match Framework For A Hybrid Short Message Service Spam Filtering Technique
No ratings yet
Boyer Moore String-Match Framework For A Hybrid Short Message Service Spam Filtering Technique
9 pages
SMS Spam Classification Using WEKA: Dipak R. Kawade Kavita S. Oza
No ratings yet
SMS Spam Classification Using WEKA: Dipak R. Kawade Kavita S. Oza
5 pages
Future Generation Computer Systems: Pradeep Kumar Roy Jyoti Prakash Singh Snehasish Banerjee
No ratings yet
Future Generation Computer Systems: Pradeep Kumar Roy Jyoti Prakash Singh Snehasish Banerjee
10 pages
SMS Scam Detection System Analysis
No ratings yet
SMS Scam Detection System Analysis
14 pages
SMS Spam Detection Methods
No ratings yet
SMS Spam Detection Methods
14 pages
Sms Spam Detection Project Final
No ratings yet
Sms Spam Detection Project Final
59 pages
SMS Spam Detection Using Machine Learning: An Experimental Study
No ratings yet
SMS Spam Detection Using Machine Learning: An Experimental Study
7 pages
Smsassassin: Crowdsourcing Driven Mobile-Based System For Sms Spam Filtering
No ratings yet
Smsassassin: Crowdsourcing Driven Mobile-Based System For Sms Spam Filtering
6 pages
Message Spam Identification by Naive Bayes Classifier Algorithm Using Machine Learning
No ratings yet
Message Spam Identification by Naive Bayes Classifier Algorithm Using Machine Learning
5 pages
SMS Spam Detection To Succour and Endorse
No ratings yet
SMS Spam Detection To Succour and Endorse
3 pages
Email Spam
No ratings yet
Email Spam
8 pages
Sms Spam
No ratings yet
Sms Spam
14 pages
A Hybrid Machine Learning Approach For Spam and Malware
No ratings yet
A Hybrid Machine Learning Approach For Spam and Malware
14 pages
Project Report Template AICTE Internship 2025
No ratings yet
Project Report Template AICTE Internship 2025
20 pages
Application Development Lab Report: Sree Dattha Group of Institution, Hyderabad
No ratings yet
Application Development Lab Report: Sree Dattha Group of Institution, Hyderabad
32 pages
A Comparative Study For SMS Spam Detection
No ratings yet
A Comparative Study For SMS Spam Detection
4 pages
Fjet 12 11 4
No ratings yet
Fjet 12 11 4
13 pages
Journal Paper
No ratings yet
Journal Paper
7 pages
RTRP Batch 10
No ratings yet
RTRP Batch 10
20 pages
Opll
No ratings yet
Opll
20 pages
B 14 Sms Spam Detection ML Ieee Report
No ratings yet
B 14 Sms Spam Detection ML Ieee Report
5 pages
Nisha Internship3
No ratings yet
Nisha Internship3
87 pages
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
No ratings yet
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
7 pages
V24i0527 1714999068
No ratings yet
V24i0527 1714999068
7 pages
IJRPR11625
No ratings yet
IJRPR11625
6 pages
(KAVYA R SHETTY)
No ratings yet
(KAVYA R SHETTY)
21 pages
Sms Spaming Detection Using NLP Techniques
No ratings yet
Sms Spaming Detection Using NLP Techniques
9 pages
Major Project by Ali (Intrainz)
No ratings yet
Major Project by Ali (Intrainz)
25 pages
SMS Defense White Paper
No ratings yet
SMS Defense White Paper
16 pages
Doceng 06
No ratings yet
Doceng 06
9 pages
View
No ratings yet
View
26 pages
Exploring The Security of Blockchain Applications: A Review of Current Solutions and Open Challenges
No ratings yet
Exploring The Security of Blockchain Applications: A Review of Current Solutions and Open Challenges
9 pages
Mathematics 12 03860
No ratings yet
Mathematics 12 03860
24 pages
Stephen 2018 IOP Conf. Ser. Mater. Sci. Eng. 396 012030
No ratings yet
Stephen 2018 IOP Conf. Ser. Mater. Sci. Eng. 396 012030
8 pages
Applsci 15 06835 v2
No ratings yet
Applsci 15 06835 v2
26 pages
1 s2.0 S209672092500017X Main
No ratings yet
1 s2.0 S209672092500017X Main
17 pages
Neurocomputing: Xianghan Zheng, Zhipeng Zeng, Zheyi Chen, Yuanlong Yu, Chunming Rong
No ratings yet
Neurocomputing: Xianghan Zheng, Zhipeng Zeng, Zheyi Chen, Yuanlong Yu, Chunming Rong
8 pages
1 s2.0 S1877050918316909 Main
No ratings yet
1 s2.0 S1877050918316909 Main
8 pages
An Early and Accurate Diagnosis and Detection of The Coronary Heart Disease Using Deep Learning and Machine Learning Algorithms
No ratings yet
An Early and Accurate Diagnosis and Detection of The Coronary Heart Disease Using Deep Learning and Machine Learning Algorithms
32 pages
Mathematics 12 01969 v2
No ratings yet
Mathematics 12 01969 v2
20 pages
Techniques To Detect Spammers in Twitter-A Survey: Monika Verma Divya, Sanjeev Sofat
No ratings yet
Techniques To Detect Spammers in Twitter-A Survey: Monika Verma Divya, Sanjeev Sofat
6 pages
VDIAZ - MT DetectingMaliciousProfilesTwitter
No ratings yet
VDIAZ - MT DetectingMaliciousProfilesTwitter
66 pages
Cao Duke 0066D 12508
No ratings yet
Cao Duke 0066D 12508
143 pages
Compa Ndss13
No ratings yet
Compa Ndss13
17 pages
Detectin NG Malic Cious Ur Rlsine E-Mail - An Imp Plementa Ation
No ratings yet
Detectin NG Malic Cious Ur Rlsine E-Mail - An Imp Plementa Ation
7 pages
A Reputation-Based Collaborative Approach For Spam Filtering
No ratings yet
A Reputation-Based Collaborative Approach For Spam Filtering
8 pages
Wago App PLC Modbus
No ratings yet
Wago App PLC Modbus
94 pages
D413EN-Grablink Hardware Manual-6.16.1.4030
No ratings yet
D413EN-Grablink Hardware Manual-6.16.1.4030
123 pages
Screen Cnvm4270a
No ratings yet
Screen Cnvm4270a
43 pages
Azure Cloud Architect Master Program
50% (4)
Azure Cloud Architect Master Program
22 pages
Chapter 5 - Rev A
100% (1)
Chapter 5 - Rev A
30 pages
HR Assistant Role at MSF India
No ratings yet
HR Assistant Role at MSF India
1 page
Soil Disposal Record - Bayshore 20122024
No ratings yet
Soil Disposal Record - Bayshore 20122024
2 pages
Chapter 3-Plotting With PyPlot
No ratings yet
Chapter 3-Plotting With PyPlot
76 pages
EVU Website Manual EN V1.1 2022-08-01
No ratings yet
EVU Website Manual EN V1.1 2022-08-01
9 pages
Doxycycline Hyclate Prices, Coupons & Savings Tips - GoodRx
No ratings yet
Doxycycline Hyclate Prices, Coupons & Savings Tips - GoodRx
1 page
The Evolution of Football Betting: A Machine Learning Approach To Match Outcome Forecasting and Bookmaker Odds Estimation
No ratings yet
The Evolution of Football Betting: A Machine Learning Approach To Match Outcome Forecasting and Bookmaker Odds Estimation
10 pages
AA04173 - Pines de Direccion
No ratings yet
AA04173 - Pines de Direccion
4 pages
Ecsbc 2024
No ratings yet
Ecsbc 2024
215 pages
Dorks
No ratings yet
Dorks
38 pages
Transformers
No ratings yet
Transformers
6 pages
Log Card Soft
No ratings yet
Log Card Soft
8 pages
Model 550 Bravo: Cessna® Illustrated Parts Catalog
No ratings yet
Model 550 Bravo: Cessna® Illustrated Parts Catalog
1 page
Ofii 221 Avenue Pierre Brossolette 92120 Montrouge Metro Station: Châtillon-Montrouge (See Map Attached)
No ratings yet
Ofii 221 Avenue Pierre Brossolette 92120 Montrouge Metro Station: Châtillon-Montrouge (See Map Attached)
1 page
Xssss
No ratings yet
Xssss
23 pages
HP Probook 440 14 Inch G10 Notebook PC: Essential, Commercial-Grade Features
No ratings yet
HP Probook 440 14 Inch G10 Notebook PC: Essential, Commercial-Grade Features
5 pages
Machine Learning (MCQ)
No ratings yet
Machine Learning (MCQ)
16 pages
DP 900
No ratings yet
DP 900
74 pages
Why Is Biomedical Informatics Hard?
No ratings yet
Why Is Biomedical Informatics Hard?
11 pages
Objective-Driven AI for Human-Level Intelligence
No ratings yet
Objective-Driven AI for Human-Level Intelligence
97 pages
Atme SSR Report
No ratings yet
Atme SSR Report
126 pages
Montek Tech Services PVT LTD
No ratings yet
Montek Tech Services PVT LTD
15 pages
P K K Siam 'Introduction Watson McDaniel'
No ratings yet
P K K Siam 'Introduction Watson McDaniel'
37 pages
Gis Grade 12 English
No ratings yet
Gis Grade 12 English
29 pages
CBSE Class 12 Physics Alternating Current Notes
No ratings yet
CBSE Class 12 Physics Alternating Current Notes
84 pages
1761-Article Text-4342-1-10-20230217
No ratings yet
1761-Article Text-4342-1-10-20230217
7 pages