0% found this document useful (0 votes)
1 views39 pages

Spa Ming

Uploaded by

srivatsav1110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views39 pages

Spa Ming

Uploaded by

srivatsav1110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

A

MAJOR PROJECT REPORT


ON
SMS SPAMING DETECTION USING NLP TECHNIQUES

Submitted in partial fulfilment of the requirements for award of the degree of


BACHELOR OF TECHNOLOGY

IN

CSE (DATA SCIENCE)


By
K.ARUNACHALAM 21BH5A6708

CS SRIVATSAV 21BH5A6702
T.VENKATESH 20BH1A6749
B.GIRI 20BH1A6709

Under the guidance of

Mr. V.NARESH
Asst. Professor, Dept. of CSE

DEPARTMENT OF CSE (Data Science)


St. Mary’s Engineering College
(Approved by AICTE, NEW DELHI. & Affiliated to JNTU-HYDERABAD, Accredited by
NAAC)Deshmukhi (V), Pochampally (M), Yadadri Bhuvanagiri (D), Telangana50826
St Mary’s Engineering College
(Affiliated to JNTU Hyderabad, Approved by AICTE, Accredited by NAAC) Near
Ramoji Film City, Deshmukhi(v), Yadadri bhongir Dist-508284

This is to certify that project titled as “SMS SPAMING DETECTION

USING NLP TECHNIQUES” has carried out and submitted by


K.ARUNACHALAM (21BH5A6708), CS.SRIVATSAV (21BH5A6702),
T.VENKATESH (20BH1A6749), B.GIRI (20BH1A609), have done their mini
project for partial fulfillment of award of degree in Bachelor of Technology in the
department of DATA SCIENCE.

INTERNAL GUIDE HEAD OF THE


DEPARMENT

V.NARESH
Dr.B.SRISHAILAM,M.Tech.,Ph.D.
Asst. Professor, Dept. of CSE Asst.Professor, Dept. ofCSE
(DS)

EXTERNAL EXAM
ACKNOWLEDGEMENT

The satisfaction that accompanies the successful completion of this project would be
incomplete without the mention of the people who made it possible. We consider it as a
privilege to express our gratitude and respect to all those who guided us in the completion of
the project.

We are thankful to our internal guide Mr. V. NARESH Asst. Professor, in Department
of Computer Science Engineering, St. Mary’s Engineering College for having been of a
source encouragement and for insisting vigor to do this project work

We are obliged to Dr. B, SRISHAILAM, Head of the Department of CSE (Data


Science), St. Mary’s Engineering College for his guidance and suggestion throughout
project work.

We take this opportunity to express a deep sense of gratitude to Dr. T.N. SRINIVAS
RAO, Principal of St. Mary’s Engineering College for allowing us to do this project and
for this affectionate encouragement in presenting this project work.

We convey our sincere thanks to Sri Rev. K.V.K RAO, Chairman of St. Mary’s
Engineering College for giving us learning environment to grow out self personally as well
as professionally.

We would like to express our thanks to all staff members who have helped us directly and
indirectly in accomplishing this project work. We also extended our sincere thanks to our
parents and friends for their moral support throughout the project work. Above all we thank
god almighty for his manifold mercies in carrying out this project work successfully.

K.ARUNACHALAM 21BH5A6708

CS.SRIVATSAV 21BH5A6702

T.VENKATESH 20BH1A6749

B.GIRI 20BH1A6709
DECLARATION

This is to certify that the work report in this titled ,”SMS SPAMING DETECTION
USING NLP TECHNIQUES”, submitted to the Department of CSE (Data Science), St.
Mary’s Engineering College in fulfilment of degree for the award of Bachelor of
Technology, is a bonafide work done by us. No part of the thesis is copied from books,
journals or internet and wherever the portion is taken, the same has been duly referred in the
text. The reported results are based on the project work entirely done by us and not copied
from any other sources. Also we declare that the matter embedded in this thesis has not been
submitted by us in full or partially there for the award of any degree of any other institution
or university previously.In the academic year 2023-24 under the guidance of V.NARESH.

Date

K.ARUNACHALAM 21BH5A6708

CS.SRIVATSAV 21BH5A6702

T.VENKATESH 20BH1A6749

B.GIRI 20BH1A6709
ABSTRACT
In today’s digital world, Mobile SMS (short message service) communication has almost
become a part of every human life. Meanwhile each mobile user suffers from the harass of
Spam SMS. These Spam SMS constitute veritable nuisance to mobile subscribers. Though
hackers or spammers try to intrude in mobile computing devices, SMS support for mobile
devices become more vulnerable as attacker tries to intrude into the system by sending
unsolicited messages. An attacker can gain remote access over mobile devices. We propose a
novel approach that can analyze message content and find features using the TF-IDF (term
frequency-inverse document frequency) techniques to efficiently detect Spam Messages and
Ham messages using different Machine Learning Classifiers. The Classifiers going to use in
proposed work can be measured with the help of metrics such as Accuracy, Precision and
Recall. In our proposed approach accuracy rate will be increased by using the Voting
Classifier.
INDEX

1. INTRODUCTION

2. L
LIST OF FIGURES

FIGUR
S.NO FIGURE NAME
E NO
1 7.1.1 ANDROID ARCHITECTURE
SELECT ANDROID AND MAINTAIN THE
2 7.2.1
INSTALLATION PATH OF THE ANDROID SDK.
3 7.2.2 SELECT ANDROID SDK AND AVD MANAGER
4 7.2.3 SELECT ANDROID SDK AND AVD MANAGER
5 7.2.4 INSTALLING THE ADT PLUGIN FOR ECLIPSE
6 7.2.5 PACKAGES AVAILABLE FOR DOWNLOAD
7 7.2.6 CHOOSE PACKAGES TO INSTALL
8 7.2.7 INSTALLING ARCHIVES
9 7..31 SOURCE CODE
10 7.3.2 SELECT NEW VIRTUAL DEVICE
11 7.3.3 CREATE NEW ANDROID VIRTUAL DEVICE
12 7.3.4 TEST DEVICE
13 8.1.1 DATA FLOW DIAGRAM
14 9.2 CLASS DIAGRAM
15 9.3 SEQUENCE DIAGRAM
16 9.4 SEQUENCE DIAGRAM
17 9.5 ACTIVITY DIAGRAM
LIST OF PLATES

PLATE
S.NO PLATE NAME
No.
1 12.1 Home Page
2 12.2 User Registration Page
3 12.3 Admin Login
4 12.4 User Activation
5 12.5
6 12.6
7 12.7
8 12.8
9 12.9
10 12.10
CHAPTER-1
1. INTRODUCTION
In the digital world, mobile devices are used for many utilities of daily life. It can be for
business, communications data sharing, etc.in the communication context, mobile
devices can use Emails, SMS (Short Message Service), and online chat apps for sharing
information that may be personal professional. These SMS services are mostly used.
Most companies do their business by spreading a massive SMS to the targeted
customer regarding their service, offers, and promotions. SMS are short because of
character's limitations, and they can be delivered between mobile devices through the
operating network. The user can type such messages. Another type of SMS is the
automated SMS service in which the program will send the SMS based on the type of
program. For these types of services, many third party services or APIs can send a bulk
of SMS to the user in one click. An SMS can be Spam or Ham. Spam SMS is an
unwanted or undesired text message consisting of different content related to prizes,
promotions, advertisements, and complimentary services. Spammers aim to steal
confidential information such as username, password, and financial data. Through the
Spam SMS, phishers can make the phishing attack in which phishers can send
malicious links and invite the user to visit those links to steal the sensitive information
from the user mobile. Spam message also contains spyware through which spammer
can steal the data or damage the system. The solution to these problems is the accurate
identification of SMS, whether it is Spam or Ham timely, so that users can manage the
incoming message and take action on spam messages by using feature selection,
classification of SMS is done whether it is Spam or Ham. Spam detection. A content-
based approach that analyzes content text message.
CHAPTER-2
2. LITERATURE

2.1 ANALYSIS AND EVALUATION OF PRIVACY PROTECTION


BEHAVIOR AND INFORMATION DISCLOSURE CONCERNS IN
ONLINE SOCIAL NETWORKS
REFERENCE: Mohammadi, A. and Hamidi
Online Social Networks (OSN) becomes the largest infrastructure for social
interactions like making relationship, sharing personal experiences and service
delivery. Nowadays social networks have been widely welcomed by people. Most of
the researches about managing privacy protection within social networks sites (SNS),
observes users as owner of their information. However, individuals cannot control their
privacy and it’s been controlled by groups. Using OSNs is making concerns about
privacy related to online personal data. According to number of studies, till now so
many efforts has been done to protect confidentiality and security of data on social
networks, but it seems that understanding the concept of privacy protection is too
essential for people. The purpose of this article is to analyze tools and algorithms that
proceed concerns about privacy protection and data security situation in social
networks among adults, adolescents and children. These statistical tools and algorithms,
analyze collected data. The results of this literature review showed that most
distribution of these articles in this case are related to 2014. Furthermore, survey
method was most current of collecting information in these researches.
2.2 PHISHING DETECTION: ANALYSIS OF VISUAL
SIMILARITY-BASED APPROACHES
REFERENCE : Jain, A.K. and Gupta, B.B
Phishing is one of the major problems faced by cyber-world and leads to financial
losses for both industries and individuals. Detection of phishing attack with high
accuracy has always been a challenging issue. At present, visual similarities based
techniques are very useful for detecting phishing websites efficiently. Phishing website
looks very similar in appearance to its corresponding legitimate website to deceive
users into believing that they are browsing the correct website. Visual similarity based
phishing detection techniques utilise the feature set like text content, text format,
HTML tags, Cascading Style Sheet (CSS), image, and so forth, to make the decision.
These approaches compare the suspicious website with the corresponding legitimate
website by using various features and if the similarity is greater than the predefined
threshold value then it is declared phishing. This paper presents a comprehensive
analysis of phishing attacks, their exploitation, some of the recent visual similarity
based approaches for phishing detection, and its comparative study. Our survey
provides a better understanding of the problem, current solution space, and scope of
future research to deal with phishing attacks efficiently using visual similarity based
approaches.

2.3 FIGHTING AGAINST PHISHING ATTACKS: STATE OF THE


ART AND FUTURE CHALLENGES
REFERENCE: Gupta, B. Tewari , A. Jain, A.K. and Agrawal, D.P
In the last few years, phishing scams have rapidly grown posing huge threat to global
Internet security. Today, phishing attack is one of the most common and serious threats
over Internet where cyber attackers try to steal user’s personal or financial credentials
by using either malwares or social engineering. Detection of phishing attacks with high
accuracy has always been an issue of great interest. Recent developments in phishing
detection techniques have led to various new techniques, specially designed for
phishing detection where accuracy is extremely important. Phishing problem is widely
present as there are several ways to carry out such an attack, which implies that one
solution is not adequate to address it. Two main issues are addressed in our paper. First,
we discuss in detail phishing attacks, history of phishing attacks and motivation of
attacker behind performing this attack. In addition, we also provide taxonomy of
various types of phishing attacks. Second, we provide taxonomy of various solutions
proposed in the literature to detect and defend from phishing attacks. In addition, we
also discuss various issues and challenges faced in dealing with phishing attacks and
spear phishing and how phishing is now targeting the emerging domain of IoT. We
discuss various tools and datasets that are used by the researchers for the evaluation of
their approaches. This provides better understanding of the problem, current solution
space and future research scope to efficiently deal with such attacks.
2.4 FEATURE SELECTION AND CLASSIFICATION APPROACH
FOR
REFERENCE : G. Tripathi, S. Naganna, G. Noida, and G. Noida
Feature selection has been the focus of interest for quite some time and much work has
been done. With the creation of huge databases and the consequent requirements for
good machine learning techniques, new problems arise and novel approaches to feature
selection are in demand. This survey is a comprehensive overview of many existing
methods from the 1970's to the present. It identifies four steps of a typical feature
selection method, and categorizes the different existing methods in terms of generation
procedures and evaluation functions, and reveals unattempted combinations of
generation procedures and evaluation functions. Representative methods are chosen
from each category for detailed explanation and discussion via example. Benchmark
datasets with different characteristics are used for comparative study. The strengths and
weaknesses of different methods are explained. Guidelines for applying feature
selection methods are given based on data types and domain characteristics. This
survey identifies the future research areas in feature selection, introduces newcomers to
this field, and paves the way for practitioners who search for suitable methods for
solving domain-specific real-world applications.

2.5 SMS SPAM MESSAGE DETECTION USING TERM


FREQUENCY-INVERSE DOCUMENT FREQUENCY AND
RANDOM FOREST ALGORITHM
REFERENCE : Nilam Nur Amir Sharif, N F Mohd Azmi, Suriayati
Chuprat
The daily traffic of Short Message Service (SMS) keeps increasing. As a result, it leads
to dramatic increase in mobile attacks such as spammers who plague the service with
spam messages sent to the groups of recipients. Mobile spams are a growing problem
as the number of spams keep increasing day by day even with the filtering systems.
Spams are defined as unsolicited bulk messages in various forms such as unwanted
advertisements, credit opportunities or fake lottery winner notifications. Spam
classification has become more challenging due to complexities of the messages
imposed by spammers. Hence, various methods have been developed in order to filter
spams. In this study, methods of term frequency-inverse document frequency (TF-IDF)
and Random Forest Algorithm will be applied on SMS spam message data collection.
Based on the experiment, Random Forest algorithm outperforms other algorithms with
an accuracy of 97.50%.
CHAPTER-3
3. SYSTEM ANALYSIS
3.1 EXISTING SYSTEM:
The problem of SMS spam detection and thread identification. The art clustering-based
algorithm is used in this work. It has two stages, in first stages the binary classification
technique such as NB, SVM, LDA and NMF is used to categorize the SMS into spam
or ham SMS, the second stages SMS clusters are created for ham SMS using non
negative matrix factorization and K-means clustering techniques. The SMS spam
detection and thread identification are used in many of SMS activities such are SMS
folder classification, SMS classification and SMS thread summarization. SMS threads
use two levels, the first is classification and second is clustering. SMS threads consists
of SMS messages, so it can recognize the previous communication in a message. NMF
clustering
technique performs better than K-means clustering techniques in terms of number of
SMS messages participating in threads identified.
3.1.1 DISADVANTAGES:
Filtering spam messages since SMS classification are becoming more challenging due
to the complexities of the spammers. The methods of term frequency-inverse document
frequency (TF-IDF) and Random Forest Algorithm will be applied on data and found
the accuracy among them. Only accuracy cannot determine the performance of the
algorithm. Hence determining the precision, recall and measure of the algorithms are
been observed. Performance of the algorithm various based on the features used in the
data set.
3.2 PROPOSED SYSTEM:
This segment describes the general structure of work process of the experiment. In this
examination AI instrument is utilized for the analysis and classification of the dataset.
At the principal level information is assembled from various sources to make a decent
dataset of ham and spam in text format and give that information as the input for the
model. At the second degree of the investigation we changed over the informational
collection which is prior in the text format to CSV (Comma Separated Value). At that
point pre-processing is accomplished for a superior quality info either by removing of
unrequired words or by performing stemming on them. Then the pre-processed data
information is changed into a machine readable form or non-contextual form by
changing over to vector or by doing discretization. The labeled data is opened and the
attributes are recorded. The attributes that are utilized for the investigation intention are
text and class in this dataset. From that point forward, a classifier is applied to the
dataset we have used. Hence the information is trained utilizing the dataset. Testing is
performed on the testing data to get the conclusive results. At the last step of the
experiment, Confusion Matrix are acquired from the dataset and the results of the
applied classifier are investigated.

3.2.1 ADVANTAGES:
We have collected the SMS Spam dataset, which is publicly available on the UCI
repository it consists of 5572 text messages classified as 747 spam messages and 4825
ham messages. Once we have gathered dataset then we can apply sequential steps on
the dataset first doing the exploratory data Analysis (EDA) on the dataset then go for
Test Preprocessing to clean the message text like to remove the special symbol, convert
the text into lower case and so on. Next step to convert cleaned text into the numerical
value before to apply classifiers for that we use TFIDF Technique to extract the feature.
After the feature extraction we apply different individual and ensemble classifiers such
as Random Forest, Bernoulli Naïve Bayes, Support Vector Machine, Bagging
Classifier, and Extra Random Tree and then apply voting classifier to vote which is the
best individual classifier for the spam detection.
CHAPTER-4
4. FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal is put
forth with a very general plan for the project and some cost estimates. During system
analysis the feasibility study of the proposed system is to be carried out. This is to
ensure that the proposed system is not a burden to the company. For feasibility
analysis, some understanding of the major requirements for the system is essential.
THREE KEY CONSIDERATIONS INVOLVED IN THE
FEASIBILITY ANALYSIS ARE:
 ECONOMICAL FEASIBILITY
 TECHNICAL FEASIBILITY
 SOCIAL FEASIBILITY
4.1 ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will
have on the organization. The amount of fund that the company can pour into the
research and development of the system is limited. The expenditures must be
justified. Thus the developed system as well within the budget and this was achieved
because most of the technologies used are freely available. Only the customized
products had to be purchased.
4.2 TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the
technical requirements of the system. Any system developed must not have a high
demand on the available technical resources. This will lead to high demands on the
available technical resources. This will lead to high demands being placed on the client.
The developed system must have a modest requirement, as only minimal or null
changes are required for implementing this system.
4.3 SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently. The
user must not feel threatened by the system, instead must accept it as a necessity. The
level of acceptance by the users solely depends on the methods that are employed to
educate the user about the system and to make him familiar with it. His level of
confidence must be raised so that he is also able to make some constructive criticism,
which is welcomed, as he is the final user of the system.
CHAPTER-5
5. SYSTEM REQUIREMENTS
5.1 HARDWARE REQUIREMENTS:
 System : Intel Core i3 processor
 Hard Disk : 250GB
 Ram : 4GB
 Monitor : 15” LED
 Input devices : keyboard, mouse
5.2 SOFTWARE REQUIREMENTS:
 Operating system : Windows 10 64bits / 32bits
 Coding Language : Python

 Front-End : HTML, CSS, JAVASCRIPT


 Data Base : MySQL
CHAPTER-6
6. SYSTEM DESIGN

6.1 SYSTEM ARCHITECTURE:

Full Text: “SMS Domain-Specific Text Data: Change to win the prize! Invest 1000 and
get 100000” flows to “BERT Text Pre-processor:
1. Tokenization,
2. Adding special tokens,
3. Padding and truncation,
4. Segment embedding,
5. Masking
which flows along an arrow labeled “Processed SMS Text” to “Pre-Trained BERT
(base) Model” which flows to “Contextual Sentence Embedding for SMS Text” over a
series of numbers within brackets and flows to “Classifier: ML Models (SVM, RF,
XGBoost), DL Models (LSTM, BiLSTM)” which then flows to “Spam or Ham”,
another arrow labeled “Label (Spam or Ham) points to the Classifier stage. We possess
the unprocessed text of SMS domain in our dataset, and to make it compatible with the
BERT model, we must undertake text data preprocessing. As previously mentioned,
BERT offers its own preprocessing package that transforms raw text into processed text
that is suitable for use with BERT. As depicted ithe first step involves providing the
SMS raw text to the preprocessor, which then proceeds to follow the steps outlined
below.
1. Tokenization: The input text is broken down into individual words or sub words,
called tokens. BERT uses Word Piece tokenization, which means that it can split words
into smaller sub words as needed, allowing for more efficient use of the model's
vocabulary.
2. Adding special tokens: BERT requires special tokens to indicate the beginning and
end of a sentence, as well as to mark where the actual text ends and any padding
begins. These special tokens are added to the tokenized text.
3. Padding and truncation: BERT models require fixed-length inputs, so the text ,is
either padded with special tokens or truncated to a specific length.
4. Segment embedding: In order to enable the model to differentiate between
different sentences in a document, each token is assigned a segment ID indicating
which sentence it belongs to.
5. Masking: A random subset of the input tokens is masked during training, meaning
that they are replaced with a special [MASK] token. This encourages the model to learn
to predict missing words based on the surrounding context.
Next, Once the preprocessor generates your data in a format that can be input to the
BERT model. We fine-tune the BERT model. Fine-tuning involves training the model
on your downstream task using the prepared data. During fine-tuning, the BERT model
is trained with two techniques that have been used previously, namely the masked
language model (MLM) and the next sentence prediction (NSP). With the MLM
technique, a word in a sentence is hidden, and BERT is required to predict the masked
word in both directions by taking into account the surrounding words. In contrast, NSP
ensures that BERT learns the connection between two sentences by asking it to predict
the subsequent sentence. Notably, BERT is trained using both MLM and NSP
techniques simultaneously, with a 50% split of each. After following masked the
language model (MLM) and the next sentence prediction (NSP) methods rigorously
BERT generates contextual sentence embedding for the entire SMS text which served
as input.
6.2 DATA FLOW DIAGRAM:
1. The DFD is also called as bubble chart. It is a simple graphical formalism that can be
used to represent a system in terms of input data to the system, various processing
carried out on this data, and the output data is generated by this system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It is used to
model the system components. These components are the system process, the data used
by the process, an external entity that interacts with the system and the information
flows in the system.
3. DFD shows how the information moves through the system and how it is modified by a
series of transformations. It is a graphical technique that depicts information flow and
the transformations that are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at any
level of abstraction. DFD may be partitioned into levels that represent increasing
information flow and functional detail.
6.3 UML DIAGRAMS
UML stands for Unified Modeling Language. UML is a standardized general-
purpose modeling language in the field of object-oriented software engineering. The
standard is managed, and was created by, the Object Management Group.
The goal is for UML to become a common language for creating models of object
oriented computer software. In its current form UML is comprised of two major
components: a Meta-model and a notation. In the future, some form of method or
process may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying,
Visualization, Constructing and documenting the artifacts of software system, as well
as for business modeling and other non-software systems.
The UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented software and the
software development process. The UML uses mostly graphical notations to express
the design of software projects.

GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations, frameworks,
patterns and components.
7. Integrate best practices.

6.4 USE CASE DIAGRAM:


A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a
graphical overview of the functionality provided by a system in terms of actors, their
goals (represented as use cases), and any dependencies between those use cases. The
main purpose of a use case diagram is to show what system functions are performed for
which actor. Roles of the actors in the system can be depicted.
Use case:
6.5 SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction
diagram that shows how processes operate with one another and in what order. It is a
construct of a Message Sequence Chart. Sequence diagrams are sometimes called event
diagrams, event scenarios, and timing diagrams.
6.6 CLASS DIAGRAM:
In software engineering, a class diagram in the Unified Modeling Language (UML) is a
type of static structure diagram that describes the structure of a system by showing the
system's classes, their attributes, operations (or methods), and the relationships among
the classes. It explains which class contains information.

6.7 ACTIVITY DIAGRAM:


Activity diagrams are graphical representations of workflows of stepwise activities and
actions with support for choice, iteration and concurrency. In the Unified Modeling
Language, activity diagrams can be used to describe the business and operational step-
by-step workflows of components in a system. An activity diagram shows the overall
flow of control.
Collaboration:

Component:
Deployment:
CHAPTER-7
7. IMPLEMENTATION
7.1 DATA COLLECTION:
The quality of the dataset is of great value while performing any experiments in data
mining.
The dataset contains the context of SMS and the category of that SMS as ham or spam.
First, choose a dataset that contains SMS’s context with category.
7.2 PRE-PROCESSING:
Pre-processing of data is performed to improve the training model’s learning process.
SMS contains stop words, punctuation, and upper and lower case words that can affect
and reduce the learning of the training model. The processing is applied after collecting
the dataset with an equal number of SMS. Firstly, tokens of SMS are made because
SMS is the string of words and is difficult to understand for the model’s training. Each
SMS splits into words so that pre-processing can be applied. At a later stage, stop
words are removed as they have no weight-age. Afterward, stemming is performed
because SMS words are sometimes not complete or characters are not typed. So,
Stemming is necessary to correct the spellings of tokenized words. Furthermore, the
numeric values are removed because digits make no impact in identifying ham or spam
SMS and are considered ignored. Finally, the punctuation is removed, and the proposed
model will be well trained.
7.3 ML ALGORITHM:
The attributes that are utilized for the investigation intention are text and class in this
dataset. From that point forward, a classifier is applied to the dataset we have used.
Hence the information is trained utilizing the dataset. Testing is performed on the
testing data to get the conclusive results. At the last step of the experiment, Confusion
Matrix are acquired from the dataset and the results of the applied classifier are
investigated and talked about
KNN Classification
Decision Tree Algorithm
Naive Bayes
CHAPTER-8
8. SYSTEM TEST
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, subassemblies, assemblies and/or a finished product. It is
the process of exercising software with the intent of ensuring that the Software system
meets its requirements and user expectations and does not fail in an unacceptable
manner. There are various types of test. Each test type addresses a specific testing
requirement.
8.1 TYPES OF TESTS:
8.1.1 UNIT TESTING
Unit testing involves the design of test cases that validate that the
internal program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It is the
testing of individual software units of the application .it is done after the completion of
an individual unit before integration. This is a structural testing, that relies on
knowledge of its construction and is invasive. Unit tests perform basic tests at
component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs
accurately to the documented specifications and contains clearly defined inputs and
expected results.
8.1.2 INTEGRATION TESTING
Integration tests are designed to test integrated software
components to determine if they actually run as one program. Testing is event driven
and is more concerned with the basic outcome of screens or fields. Integration tests
demonstrate that although the components were individually satisfaction, as shown by
successfully unit testing, the combination of components is correct and consistent.
Integration testing is specifically aimed at exposing the problems that arise from the
combination of components.
8.1.3 FUNCTIONAL TEST
Functional tests provide systematic demonstrations that functions tested are
available as specified by the business and technical requirements, system
documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Output : identified classes of application outputs must be exercised.
Systems/Procedures : interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on
requirements, key functions, or special test cases. In addition, systematic coverage
pertaining to identify Business process flows; data fields, predefined processes, and
successive processes must be considered for testing. Before functional testing is
complete, additional tests are identified and the effective value of current tests is
determined.
8.1.4 SYSTEM TEST
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results. An
example of system testing is the configuration oriented system integration test. System
testing is based on process descriptions and flows, emphasizing pre-driven process
links and integration points.
8.1.5 WHITE BOX TESTING
White Box Testing is a testing in which in which the software tester has
knowledge of the inner workings, structure and language of the software, or at least its
purpose. It is purpose. It is used to test areas that cannot be reached from a black box
level.
8.1.6 BLACK BOX TESTING
Black Box Testing is testing the software without any knowledge of the inner
workings, structure or language of the module being tested. Black box tests, as most
other kinds of tests, must be written from a definitive source document, such as
specification or requirements document, such as specification or requirements
document. It is a testing in which the software under test is treated, as a black box .you
cannot “see” into it. The test provides inputs and responds to outputs without
considering how the software works.
8.1.7 ACCEPTANCE TESTING
User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional
requirements.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.

8.1.8 UNIT TESTING


Unit testing is usually conducted as part of a combined code and unit
test phase of the software lifecycle, although it is not uncommon for coding and unit
testing to be conducted as two distinct phases.
8.2 TEST STRATEGY AND APPROACH
Field testing will be performed manually and functional tests will be written in
detail.
8.2.1 TEST OBJECTIVES
 All field entries must work properly.
 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.

8.2.2 FEATURES TO BE TESTED


 Verify that the entries are of the correct format
 No duplicate entries should be allowed
 All links should take the user to the correct page.
CHAPTER-9
9. INPUT AND OUTPUT DESIGN
9.1 INPUT DESIGN:
The input design is the link between the information system and the user. It
comprises the developing specification and procedures for data preparation and those
steps are necessary to put transaction data in to a usable form for processing can be
achieved by inspecting the computer to read data from a written or printed document or
it can occur by having people keying the data directly into the system. The design of
input focuses on controlling the amount of input required, controlling the errors,
avoiding delay, avoiding extra steps and keeping the process simple. The input is
designed in such a way so that it provides security and ease of use with retaining the
privacy. Input Design considered the following things:
 What data should be given as input?
 How the data should be arranged or coded?
 The dialog to guide the operating personnel in providing input.
 Methods for preparing input validations and steps to follow when error occur.

9.1.1 OBJECTIVES:
1. Input Design is the process of converting a user-oriented description of the input into
a computer-based system. This design is important to avoid errors in the data input
process and show the correct direction to the management for getting correct
information from the computerized system.
2. It is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The goal of designing input is to make data entry easier and to be free
from errors. The data entry screen is designed in such a way that all the data
manipulates can be performed. It also provides record viewing facilities.
3. When the data is entered it will check for its validity. Data can be entered with the
help of screens. Appropriate messages are provided as when needed so that the user
will not be in maize of instant. Thus the objective of input design is to create an input
layout that is easy to follow

9.2 OUTPUT DESIGN:


A quality output is one, which meets the requirements of the end user and
presents the information clearly. In any system results of processing are communicated
to the users and to other system through outputs. In output design it is determined how
the information is to be displaced for immediate need and also the hard copy output. It
is the most important and direct source information to the user. Efficient and intelligent
output design improves the system’s relationship to help user decision-making.
1. Designing computer output should proceed in an organized, well thought out
manner; the right output must be developed while ensuring that each output element is
designed so that people will find the system can use easily and effectively. When
analysis design computer output, they should Identify the specific output that is needed
to meet the requirements.
2. Select methods for presenting information.
3. Create document, report, or other formats that contain information produced by the
system.
The output form of an information system should accomplish one or more of the
following objectives.
 Convey information about past activities, current status or projections of the
 Future.
 Signal important events, opportunities, problems, or warnings.
 Trigger an action.
 Confirm an action.
CHAPTER-10
10. SCREENSHOT
Load:

Target:
View:

Graph:

Data:

RF:
CHAPTER-11
11. CONCLUSION
We will test our classification model on our prepared dataset and also
measure the SMS spam detection performance on our dataset. To evaluate the
performance of our created classification and make it comparable to current
approaches, we use Accuracy to measure the effectiveness of classifiers. The
Experiment was performed on various classifier such as decision tree, KNN classifier,
Naïve Bayes for SMS spam detection. Naïve Bayes classifier showed the highest
accuracy among others classifier. Future work must rehearse a few ways to deal with
raise the part of the feature plot. Including progressively important features like certain
limits for the length and learning curves can add to the improvement in results. An
application can be used for mobile phones utilizing these techniques in future for
protecting our mobile phones from spam message.
CHAPTER-12
12. REFERENCE
1. Mohammadi, A. and Hamidi, H., "ANALYSIS AND EVALUATION OF PRIVACY
PROTECTION BEHAVIOR AND INFORMATION DISCLOSURE CONCERNS IN
ONLINE SOCIAL NETWORKS", International Journal of Engineering,
Transactions B: Applications, Vol. 31, No. 8, (2018),1234-1239
2. Jain, A.K. and Gupta, B.B., "PHISHING DETECTION: ANALYSIS OF VISUAL
SIMILARITY-BASED APPROACHES", Security and Communication Networks, Vol.
2017, No., (2017).
3. Gupta, B.B., Tewari, A., Jain, A.K. and Agrawal, D.P., "FIGHTING AGAINST
PHISHING ATTACKS: STATE OF THE ART AND FUTURE CHALLENGES",
Neural Computing and Applications, Vol. 28, No. 12, (2017), 3629- 3654
4. G. Tripathi, S. Naganna, G. Noida, and G. Noida, “FEATURE SELECTION AND
CLASSIFICATION APPROACH FOR,” Machine Learning and Applications: An
International Journal, vol. 2, no. 2, pp. 1–16, 2015
5. Nilam Nur Amir Sharif, N F Mohd Azmi, Suriayati Chuprat, "SMS SPAM
MESSAGE DETECTION USING TERM FREQUENCY-INVERSE DOCUMENT
FREQUENCY AND RANDOM FOREST ALGORITHM," in The Fifth Information
Systems International Conference 2019, Procedia Computer Science 161 (2019) 509-
515,ScienceDirect
6. Pavas Navaney, Gaurav Dubey, Ajay Rana, “SMS SPAM FILTERING USING
SUPERVISED MACHINE LEARNING ALGORITHMS.,” in 8th International
Conference on Cloud Computing, Data Science & Engineering, 978-1- 5386- 1719-
9/18/ 2018 IEEE
7. Gotham Sai Sravya, G Pradeepini, Vaddeswaram, ": MOBILE SMS SPAM FILTER
TECHNIQUES USING MACHINE LEARNING TECHNIQUES.," International
Journal Of Scientific & Technology Research Volume 9, Issue 03, March 2020
8. R. P. Corresponding, “A STUDY ON ANALYSIS OF SMS CLASSIFICATION
USING DOCUMENT FREQUENCY THRESHOLD,” International Journal of
Information Engineering and Electronic Business, no. February, pp . 44–50, 2012.

You might also like