Spa Ming
Spa Ming
IN
CS SRIVATSAV 21BH5A6702
T.VENKATESH 20BH1A6749
B.GIRI 20BH1A6709
Mr. V.NARESH
Asst. Professor, Dept. of CSE
V.NARESH
Dr.B.SRISHAILAM,M.Tech.,Ph.D.
Asst. Professor, Dept. of CSE Asst.Professor, Dept. ofCSE
(DS)
EXTERNAL EXAM
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of this project would be
incomplete without the mention of the people who made it possible. We consider it as a
privilege to express our gratitude and respect to all those who guided us in the completion of
the project.
We are thankful to our internal guide Mr. V. NARESH Asst. Professor, in Department
of Computer Science Engineering, St. Mary’s Engineering College for having been of a
source encouragement and for insisting vigor to do this project work
We take this opportunity to express a deep sense of gratitude to Dr. T.N. SRINIVAS
RAO, Principal of St. Mary’s Engineering College for allowing us to do this project and
for this affectionate encouragement in presenting this project work.
We convey our sincere thanks to Sri Rev. K.V.K RAO, Chairman of St. Mary’s
Engineering College for giving us learning environment to grow out self personally as well
as professionally.
We would like to express our thanks to all staff members who have helped us directly and
indirectly in accomplishing this project work. We also extended our sincere thanks to our
parents and friends for their moral support throughout the project work. Above all we thank
god almighty for his manifold mercies in carrying out this project work successfully.
K.ARUNACHALAM 21BH5A6708
CS.SRIVATSAV 21BH5A6702
T.VENKATESH 20BH1A6749
B.GIRI 20BH1A6709
DECLARATION
This is to certify that the work report in this titled ,”SMS SPAMING DETECTION
USING NLP TECHNIQUES”, submitted to the Department of CSE (Data Science), St.
Mary’s Engineering College in fulfilment of degree for the award of Bachelor of
Technology, is a bonafide work done by us. No part of the thesis is copied from books,
journals or internet and wherever the portion is taken, the same has been duly referred in the
text. The reported results are based on the project work entirely done by us and not copied
from any other sources. Also we declare that the matter embedded in this thesis has not been
submitted by us in full or partially there for the award of any degree of any other institution
or university previously.In the academic year 2023-24 under the guidance of V.NARESH.
Date
K.ARUNACHALAM 21BH5A6708
CS.SRIVATSAV 21BH5A6702
T.VENKATESH 20BH1A6749
B.GIRI 20BH1A6709
ABSTRACT
In today’s digital world, Mobile SMS (short message service) communication has almost
become a part of every human life. Meanwhile each mobile user suffers from the harass of
Spam SMS. These Spam SMS constitute veritable nuisance to mobile subscribers. Though
hackers or spammers try to intrude in mobile computing devices, SMS support for mobile
devices become more vulnerable as attacker tries to intrude into the system by sending
unsolicited messages. An attacker can gain remote access over mobile devices. We propose a
novel approach that can analyze message content and find features using the TF-IDF (term
frequency-inverse document frequency) techniques to efficiently detect Spam Messages and
Ham messages using different Machine Learning Classifiers. The Classifiers going to use in
proposed work can be measured with the help of metrics such as Accuracy, Precision and
Recall. In our proposed approach accuracy rate will be increased by using the Voting
Classifier.
INDEX
1. INTRODUCTION
2. L
LIST OF FIGURES
FIGUR
S.NO FIGURE NAME
E NO
1 7.1.1 ANDROID ARCHITECTURE
SELECT ANDROID AND MAINTAIN THE
2 7.2.1
INSTALLATION PATH OF THE ANDROID SDK.
3 7.2.2 SELECT ANDROID SDK AND AVD MANAGER
4 7.2.3 SELECT ANDROID SDK AND AVD MANAGER
5 7.2.4 INSTALLING THE ADT PLUGIN FOR ECLIPSE
6 7.2.5 PACKAGES AVAILABLE FOR DOWNLOAD
7 7.2.6 CHOOSE PACKAGES TO INSTALL
8 7.2.7 INSTALLING ARCHIVES
9 7..31 SOURCE CODE
10 7.3.2 SELECT NEW VIRTUAL DEVICE
11 7.3.3 CREATE NEW ANDROID VIRTUAL DEVICE
12 7.3.4 TEST DEVICE
13 8.1.1 DATA FLOW DIAGRAM
14 9.2 CLASS DIAGRAM
15 9.3 SEQUENCE DIAGRAM
16 9.4 SEQUENCE DIAGRAM
17 9.5 ACTIVITY DIAGRAM
LIST OF PLATES
PLATE
S.NO PLATE NAME
No.
1 12.1 Home Page
2 12.2 User Registration Page
3 12.3 Admin Login
4 12.4 User Activation
5 12.5
6 12.6
7 12.7
8 12.8
9 12.9
10 12.10
CHAPTER-1
1. INTRODUCTION
In the digital world, mobile devices are used for many utilities of daily life. It can be for
business, communications data sharing, etc.in the communication context, mobile
devices can use Emails, SMS (Short Message Service), and online chat apps for sharing
information that may be personal professional. These SMS services are mostly used.
Most companies do their business by spreading a massive SMS to the targeted
customer regarding their service, offers, and promotions. SMS are short because of
character's limitations, and they can be delivered between mobile devices through the
operating network. The user can type such messages. Another type of SMS is the
automated SMS service in which the program will send the SMS based on the type of
program. For these types of services, many third party services or APIs can send a bulk
of SMS to the user in one click. An SMS can be Spam or Ham. Spam SMS is an
unwanted or undesired text message consisting of different content related to prizes,
promotions, advertisements, and complimentary services. Spammers aim to steal
confidential information such as username, password, and financial data. Through the
Spam SMS, phishers can make the phishing attack in which phishers can send
malicious links and invite the user to visit those links to steal the sensitive information
from the user mobile. Spam message also contains spyware through which spammer
can steal the data or damage the system. The solution to these problems is the accurate
identification of SMS, whether it is Spam or Ham timely, so that users can manage the
incoming message and take action on spam messages by using feature selection,
classification of SMS is done whether it is Spam or Ham. Spam detection. A content-
based approach that analyzes content text message.
CHAPTER-2
2. LITERATURE
3.2.1 ADVANTAGES:
We have collected the SMS Spam dataset, which is publicly available on the UCI
repository it consists of 5572 text messages classified as 747 spam messages and 4825
ham messages. Once we have gathered dataset then we can apply sequential steps on
the dataset first doing the exploratory data Analysis (EDA) on the dataset then go for
Test Preprocessing to clean the message text like to remove the special symbol, convert
the text into lower case and so on. Next step to convert cleaned text into the numerical
value before to apply classifiers for that we use TFIDF Technique to extract the feature.
After the feature extraction we apply different individual and ensemble classifiers such
as Random Forest, Bernoulli Naïve Bayes, Support Vector Machine, Bagging
Classifier, and Extra Random Tree and then apply voting classifier to vote which is the
best individual classifier for the spam detection.
CHAPTER-4
4. FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal is put
forth with a very general plan for the project and some cost estimates. During system
analysis the feasibility study of the proposed system is to be carried out. This is to
ensure that the proposed system is not a burden to the company. For feasibility
analysis, some understanding of the major requirements for the system is essential.
THREE KEY CONSIDERATIONS INVOLVED IN THE
FEASIBILITY ANALYSIS ARE:
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
4.1 ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will
have on the organization. The amount of fund that the company can pour into the
research and development of the system is limited. The expenditures must be
justified. Thus the developed system as well within the budget and this was achieved
because most of the technologies used are freely available. Only the customized
products had to be purchased.
4.2 TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the
technical requirements of the system. Any system developed must not have a high
demand on the available technical resources. This will lead to high demands on the
available technical resources. This will lead to high demands being placed on the client.
The developed system must have a modest requirement, as only minimal or null
changes are required for implementing this system.
4.3 SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently. The
user must not feel threatened by the system, instead must accept it as a necessity. The
level of acceptance by the users solely depends on the methods that are employed to
educate the user about the system and to make him familiar with it. His level of
confidence must be raised so that he is also able to make some constructive criticism,
which is welcomed, as he is the final user of the system.
CHAPTER-5
5. SYSTEM REQUIREMENTS
5.1 HARDWARE REQUIREMENTS:
System : Intel Core i3 processor
Hard Disk : 250GB
Ram : 4GB
Monitor : 15” LED
Input devices : keyboard, mouse
5.2 SOFTWARE REQUIREMENTS:
Operating system : Windows 10 64bits / 32bits
Coding Language : Python
Full Text: “SMS Domain-Specific Text Data: Change to win the prize! Invest 1000 and
get 100000” flows to “BERT Text Pre-processor:
1. Tokenization,
2. Adding special tokens,
3. Padding and truncation,
4. Segment embedding,
5. Masking
which flows along an arrow labeled “Processed SMS Text” to “Pre-Trained BERT
(base) Model” which flows to “Contextual Sentence Embedding for SMS Text” over a
series of numbers within brackets and flows to “Classifier: ML Models (SVM, RF,
XGBoost), DL Models (LSTM, BiLSTM)” which then flows to “Spam or Ham”,
another arrow labeled “Label (Spam or Ham) points to the Classifier stage. We possess
the unprocessed text of SMS domain in our dataset, and to make it compatible with the
BERT model, we must undertake text data preprocessing. As previously mentioned,
BERT offers its own preprocessing package that transforms raw text into processed text
that is suitable for use with BERT. As depicted ithe first step involves providing the
SMS raw text to the preprocessor, which then proceeds to follow the steps outlined
below.
1. Tokenization: The input text is broken down into individual words or sub words,
called tokens. BERT uses Word Piece tokenization, which means that it can split words
into smaller sub words as needed, allowing for more efficient use of the model's
vocabulary.
2. Adding special tokens: BERT requires special tokens to indicate the beginning and
end of a sentence, as well as to mark where the actual text ends and any padding
begins. These special tokens are added to the tokenized text.
3. Padding and truncation: BERT models require fixed-length inputs, so the text ,is
either padded with special tokens or truncated to a specific length.
4. Segment embedding: In order to enable the model to differentiate between
different sentences in a document, each token is assigned a segment ID indicating
which sentence it belongs to.
5. Masking: A random subset of the input tokens is masked during training, meaning
that they are replaced with a special [MASK] token. This encourages the model to learn
to predict missing words based on the surrounding context.
Next, Once the preprocessor generates your data in a format that can be input to the
BERT model. We fine-tune the BERT model. Fine-tuning involves training the model
on your downstream task using the prepared data. During fine-tuning, the BERT model
is trained with two techniques that have been used previously, namely the masked
language model (MLM) and the next sentence prediction (NSP). With the MLM
technique, a word in a sentence is hidden, and BERT is required to predict the masked
word in both directions by taking into account the surrounding words. In contrast, NSP
ensures that BERT learns the connection between two sentences by asking it to predict
the subsequent sentence. Notably, BERT is trained using both MLM and NSP
techniques simultaneously, with a 50% split of each. After following masked the
language model (MLM) and the next sentence prediction (NSP) methods rigorously
BERT generates contextual sentence embedding for the entire SMS text which served
as input.
6.2 DATA FLOW DIAGRAM:
1. The DFD is also called as bubble chart. It is a simple graphical formalism that can be
used to represent a system in terms of input data to the system, various processing
carried out on this data, and the output data is generated by this system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It is used to
model the system components. These components are the system process, the data used
by the process, an external entity that interacts with the system and the information
flows in the system.
3. DFD shows how the information moves through the system and how it is modified by a
series of transformations. It is a graphical technique that depicts information flow and
the transformations that are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at any
level of abstraction. DFD may be partitioned into levels that represent increasing
information flow and functional detail.
6.3 UML DIAGRAMS
UML stands for Unified Modeling Language. UML is a standardized general-
purpose modeling language in the field of object-oriented software engineering. The
standard is managed, and was created by, the Object Management Group.
The goal is for UML to become a common language for creating models of object
oriented computer software. In its current form UML is comprised of two major
components: a Meta-model and a notation. In the future, some form of method or
process may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying,
Visualization, Constructing and documenting the artifacts of software system, as well
as for business modeling and other non-software systems.
The UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented software and the
software development process. The UML uses mostly graphical notations to express
the design of software projects.
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations, frameworks,
patterns and components.
7. Integrate best practices.
Component:
Deployment:
CHAPTER-7
7. IMPLEMENTATION
7.1 DATA COLLECTION:
The quality of the dataset is of great value while performing any experiments in data
mining.
The dataset contains the context of SMS and the category of that SMS as ham or spam.
First, choose a dataset that contains SMS’s context with category.
7.2 PRE-PROCESSING:
Pre-processing of data is performed to improve the training model’s learning process.
SMS contains stop words, punctuation, and upper and lower case words that can affect
and reduce the learning of the training model. The processing is applied after collecting
the dataset with an equal number of SMS. Firstly, tokens of SMS are made because
SMS is the string of words and is difficult to understand for the model’s training. Each
SMS splits into words so that pre-processing can be applied. At a later stage, stop
words are removed as they have no weight-age. Afterward, stemming is performed
because SMS words are sometimes not complete or characters are not typed. So,
Stemming is necessary to correct the spellings of tokenized words. Furthermore, the
numeric values are removed because digits make no impact in identifying ham or spam
SMS and are considered ignored. Finally, the punctuation is removed, and the proposed
model will be well trained.
7.3 ML ALGORITHM:
The attributes that are utilized for the investigation intention are text and class in this
dataset. From that point forward, a classifier is applied to the dataset we have used.
Hence the information is trained utilizing the dataset. Testing is performed on the
testing data to get the conclusive results. At the last step of the experiment, Confusion
Matrix are acquired from the dataset and the results of the applied classifier are
investigated and talked about
KNN Classification
Decision Tree Algorithm
Naive Bayes
CHAPTER-8
8. SYSTEM TEST
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, subassemblies, assemblies and/or a finished product. It is
the process of exercising software with the intent of ensuring that the Software system
meets its requirements and user expectations and does not fail in an unacceptable
manner. There are various types of test. Each test type addresses a specific testing
requirement.
8.1 TYPES OF TESTS:
8.1.1 UNIT TESTING
Unit testing involves the design of test cases that validate that the
internal program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It is the
testing of individual software units of the application .it is done after the completion of
an individual unit before integration. This is a structural testing, that relies on
knowledge of its construction and is invasive. Unit tests perform basic tests at
component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs
accurately to the documented specifications and contains clearly defined inputs and
expected results.
8.1.2 INTEGRATION TESTING
Integration tests are designed to test integrated software
components to determine if they actually run as one program. Testing is event driven
and is more concerned with the basic outcome of screens or fields. Integration tests
demonstrate that although the components were individually satisfaction, as shown by
successfully unit testing, the combination of components is correct and consistent.
Integration testing is specifically aimed at exposing the problems that arise from the
combination of components.
8.1.3 FUNCTIONAL TEST
Functional tests provide systematic demonstrations that functions tested are
available as specified by the business and technical requirements, system
documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Output : identified classes of application outputs must be exercised.
Systems/Procedures : interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on
requirements, key functions, or special test cases. In addition, systematic coverage
pertaining to identify Business process flows; data fields, predefined processes, and
successive processes must be considered for testing. Before functional testing is
complete, additional tests are identified and the effective value of current tests is
determined.
8.1.4 SYSTEM TEST
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results. An
example of system testing is the configuration oriented system integration test. System
testing is based on process descriptions and flows, emphasizing pre-driven process
links and integration points.
8.1.5 WHITE BOX TESTING
White Box Testing is a testing in which in which the software tester has
knowledge of the inner workings, structure and language of the software, or at least its
purpose. It is purpose. It is used to test areas that cannot be reached from a black box
level.
8.1.6 BLACK BOX TESTING
Black Box Testing is testing the software without any knowledge of the inner
workings, structure or language of the module being tested. Black box tests, as most
other kinds of tests, must be written from a definitive source document, such as
specification or requirements document, such as specification or requirements
document. It is a testing in which the software under test is treated, as a black box .you
cannot “see” into it. The test provides inputs and responds to outputs without
considering how the software works.
8.1.7 ACCEPTANCE TESTING
User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional
requirements.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
9.1.1 OBJECTIVES:
1. Input Design is the process of converting a user-oriented description of the input into
a computer-based system. This design is important to avoid errors in the data input
process and show the correct direction to the management for getting correct
information from the computerized system.
2. It is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The goal of designing input is to make data entry easier and to be free
from errors. The data entry screen is designed in such a way that all the data
manipulates can be performed. It also provides record viewing facilities.
3. When the data is entered it will check for its validity. Data can be entered with the
help of screens. Appropriate messages are provided as when needed so that the user
will not be in maize of instant. Thus the objective of input design is to create an input
layout that is easy to follow
Target:
View:
Graph:
Data:
RF:
CHAPTER-11
11. CONCLUSION
We will test our classification model on our prepared dataset and also
measure the SMS spam detection performance on our dataset. To evaluate the
performance of our created classification and make it comparable to current
approaches, we use Accuracy to measure the effectiveness of classifiers. The
Experiment was performed on various classifier such as decision tree, KNN classifier,
Naïve Bayes for SMS spam detection. Naïve Bayes classifier showed the highest
accuracy among others classifier. Future work must rehearse a few ways to deal with
raise the part of the feature plot. Including progressively important features like certain
limits for the length and learning curves can add to the improvement in results. An
application can be used for mobile phones utilizing these techniques in future for
protecting our mobile phones from spam message.
CHAPTER-12
12. REFERENCE
1. Mohammadi, A. and Hamidi, H., "ANALYSIS AND EVALUATION OF PRIVACY
PROTECTION BEHAVIOR AND INFORMATION DISCLOSURE CONCERNS IN
ONLINE SOCIAL NETWORKS", International Journal of Engineering,
Transactions B: Applications, Vol. 31, No. 8, (2018),1234-1239
2. Jain, A.K. and Gupta, B.B., "PHISHING DETECTION: ANALYSIS OF VISUAL
SIMILARITY-BASED APPROACHES", Security and Communication Networks, Vol.
2017, No., (2017).
3. Gupta, B.B., Tewari, A., Jain, A.K. and Agrawal, D.P., "FIGHTING AGAINST
PHISHING ATTACKS: STATE OF THE ART AND FUTURE CHALLENGES",
Neural Computing and Applications, Vol. 28, No. 12, (2017), 3629- 3654
4. G. Tripathi, S. Naganna, G. Noida, and G. Noida, “FEATURE SELECTION AND
CLASSIFICATION APPROACH FOR,” Machine Learning and Applications: An
International Journal, vol. 2, no. 2, pp. 1–16, 2015
5. Nilam Nur Amir Sharif, N F Mohd Azmi, Suriayati Chuprat, "SMS SPAM
MESSAGE DETECTION USING TERM FREQUENCY-INVERSE DOCUMENT
FREQUENCY AND RANDOM FOREST ALGORITHM," in The Fifth Information
Systems International Conference 2019, Procedia Computer Science 161 (2019) 509-
515,ScienceDirect
6. Pavas Navaney, Gaurav Dubey, Ajay Rana, “SMS SPAM FILTERING USING
SUPERVISED MACHINE LEARNING ALGORITHMS.,” in 8th International
Conference on Cloud Computing, Data Science & Engineering, 978-1- 5386- 1719-
9/18/ 2018 IEEE
7. Gotham Sai Sravya, G Pradeepini, Vaddeswaram, ": MOBILE SMS SPAM FILTER
TECHNIQUES USING MACHINE LEARNING TECHNIQUES.," International
Journal Of Scientific & Technology Research Volume 9, Issue 03, March 2020
8. R. P. Corresponding, “A STUDY ON ANALYSIS OF SMS CLASSIFICATION
USING DOCUMENT FREQUENCY THRESHOLD,” International Journal of
Information Engineering and Electronic Business, no. February, pp . 44–50, 2012.