Report Tech Gautami Final
Report Tech Gautami Final
BELAGAVI, KARNATAKA
                        A Seminar Report on
“Towards Fraudulent URL Classification with Large Language
                Model based on Deep Learning”
 Submitted in the partial fulfillment for the requirements for the
                    conferment of Degree of
                  BACHELOR OF ENGINEERING
                                   in
                                  By
             Gautami Rakesh                    USN:1BY20IS058
                              Dr. Savitha S.
                     Assistant Professor, BMSIT&M
2023-2024
                    VISVESVARAYA TECHNOLOGICAL UNIVERSITY
                                BELAGAVI, KARNATAKA
                  BMS INSTITUTE OF TECHNOLOGY & MANAGEMENT
                               YELAHANKA, BENGALURU-560064
CERTIFICATE
This is to certify that the Seminar (18ISS86) entitled “Towards Fraudulent URL Classification
with Large Language Model based on Deep Learning” is a bonafide work carried out by Gautai
Rakesh (IBY20IS058) partial fulfillment for the award of Bachelor of Engineering Degree in
Information Science and Engineering of the Visvesvaraya Technological University, Belagavi
during the year 2023-2024. It is certified that all corrections/suggestions indicated for Internal
Assessment have been incorporated in this report. The seminar report has been approved as it
satisfies the academic requirements with respect to seminar work for the B.E Degree.
__________________                                                  __________________
Signature of the Guide                                             Signature of the HOD
  Dr Savitha S.                                                        Dr. Pushpa S K
                                ____________________________
                                 Signature of the Coordinator
                                       Dr. Drakshaveni G
                          ACKNOWLEDGEMENT
I am happy to present this Technical Seminar after completing it successfully. This seminar
would not have been possible without the guidance, assistance and suggestions of many
individuals. I would like to express our deep sense of gratitude and indebtedness to each and
every one who has helped us make this project a success.
I heartily thank our Principal, Dr. Sanjay H A, BMS Institute of Technology &
Management for his constant encouragement and inspiration in taking up this seminar.
I heartily thank our Head of Department Dr. Pushpa S K, Dept. of Information Science
and Engineering, BMS Institute of Technology & Management for her constant
encouragement and inspiration in taking up this seminar.
Special thanks to all the staff members of Information Science Department for their help and
kind co-operation.
Lastly, I thank our parents and friends for their encouragement and support given to us in
order to finish this precious work.
                                                                   By,
                                                                   Gautami Rakesh
                                       i
Cloud Assisted Smart Learning System
Declaration
I, hereby declare that the Technical Seminar titled Towards Fraudulent URL Classification
with Large Language Model based on Deep Learning Is a record of original project work
undertaken for the award of the degree Bachelor of Engineering in Information Science and
Engineering of the Visvesvaraya Technological University, Belagavi during the year 2023-
2024. I have completed this project under the guidance of Dr Savitha S.
I also declare that this project report has not been submitted for the award of any degree,
diploma, associate ship, fellowship or other title anywhere else.
Student Photo
USN - 1BY20IS058
 Signature    -
Cloud Assisted Smart Learning System
ii
                                         ABSTRACT
The rising tide of fraud, fueled by the easy access to personal data via web addresses,
presents a pressing concern that demands immediate attention. Traditional methods such as
machine learning algorithms or static lists have been commonly used to identify fraudulent
URLs. However, these approaches often prove to be time-consuming and yield suboptimal
results. In response to this challenge, this study proposes an innovative solution that
leverages language models for the detection of fraudulent websites.
Unlike conventional methods, which may struggle with interpretability and scalability, the
proposed approach emphasizes the importance of interpretability analysis. By employing
language models, this method aims to provide more nuanced insights into the characteristics
of fraudulent websites, thereby enhancing the overall effectiveness of fraud detection efforts.
Moreover, the ultimate goal of this study is to deploy the developed model onto a server,
offering a robust and scalable solution to combat the pervasive issue of online fraud.
Through the implementation of this novel approach, the study seeks to address the
shortcomings of existing fraud detection methods and provide a more efficient and effective
means of combating fraudulent activities on the internet. By focusing on interpretability and
scalability, the proposed model aims to not only improve the accuracy of fraud detection but
also facilitate its integration into existing online security infrastructures.
Cloud Assisted Smart Learning System
iii
INDEX
ACKNOWLEDGEMENT……………………………… i
DECLARATION………………………………………... ii
ABSTRACT……………………………………………... iii
LIST OF FIGURES…………………………………….. v
Chapter Title
        4.1         Design                                                      10
           5        Implementation                                              [12-14]
           6        Future Scope                                                15
           7        Application                                                 16
           8        Conclusion                                                  17
           9        References                                                  18
                              LIST OF FIGURES
 Figure No.                         Figure name                             Page No.
     1.1             Cloud Computing Architecture                              2
     4.1             The communication steps between user and server           11
CHAPTER I
                              INTRODUCTION
The proliferation of internet fraud has become a pressing concern alongside the rapid
expansion and accessibility of the internet. Various forms of online scams, such as credit
fraud, romance scams, and phishing, pose significant threats to personal privacy, social
stability, and economic well-being. Detecting fraudulent websites has thus become an urgent
priority, yet traditional methods reliant on manual rule-making or feature engineering
struggle to keep pace with the constantly evolving online landscape.
To address these challenges, this study proposes a novel approach leveraging pre-trained
language models for fraudulent URL classification. By training on a comprehensive dataset
and utilizing the vast knowledge encoded within language models, this method aims to
automatically learn richer feature representations without the need for manual intervention.
long short-term memory (LSTM) networks, the language model-based approach exhibits
enhanced adaptability to evolving fraudulent websites while reducing labor costs.
1.2 MOTIVATION
      Rising Threat of Internet Fraud: With the exponential growth of internet usage,
       various forms of online fraud have emerged, including phishing, credit card scams,
       and identity theft. The increasing prevalence of these fraudulent activities poses a
       significant threat to individuals' privacy, financial security, and overall trust in online
       transactions.
      Inadequacy of Traditional Methods: Traditional approaches to detecting fraudulent
       websites often rely on manual rule-making or feature engineering, which are time-
       consuming and struggle to keep pace with the rapidly evolving tactics employed by
       fraudsters. These methods are often ineffective in accurately identifying new types of
       fraudulent websites and may result in high false-positive rates.
      Promise of Deep Learning: Deep learning techniques, particularly convolutional
       neural networks (CNNs) and recurrent neural networks (RNNs), have shown promise
       in various classification tasks, including fraud detection. However, existing deep
       learning models may face challenges in capturing the complex linguistic and
       structural patterns present in fraudulent URLs, limiting their effectiveness.
      Opportunity Presented by Language Models: Pre-trained language models, such as
       BERT and GPT, offer a unique opportunity to leverage large-scale linguistic
       knowledge for fraudulent URL classification. By harnessing the contextual
       understanding and feature extraction capabilities of these language models, it is
Cloud Assisted Smart Learning System
       possible to develop a more robust and accurate detection system capable of adapting
       to the evolving landscape of internet fraud.
1.3 OBJECTIVE
   The objective of this study is to develop a robust and efficient method for detecting
   fraudulent websites by leveraging pre-trained language models. Specifically, the
   objectives include:
      Enhanced Fraud Detection: To improve the accuracy and reliability of fraudulent
       website classification by harnessing the power of language models to automatically
       extract rich feature representations from URL data.
      Adaptability to Changing Fraud Tactics: To develop a detection system that can
       effectively adapt to the constantly evolving tactics used by fraudsters by leveraging
       the contextual understanding and generalization capabilities of language models.
      Reduced Manual Intervention:To minimize the need for manual rule-making or
       feature engineering by leveraging the automatic feature extraction capabilities of
       language models, thereby reducing labor costs and improving efficiency.
      Experimental Validation: To empirically evaluate the proposed method's
       performance against traditional rule-based approaches and conventional deep learning
       models using a comprehensive dataset, demonstrating its superiority in terms of
       classification accuracy, robustness, and adaptability.
Cloud Assisted Smart Learning System
CHAPTER II
                           LITERATURE SURVEY
After a thorough search and evaluation of the available literature in the given project it has
been selected and enhanced in the particular area. The literature review of the documents that
support this system has been represented below.
The prevalence of phishing and spear phishing attacks, highlighted by Verizon's reports,
necessitates effective countermeasures. This study compares deep learning frameworks like
Keras and Fast.ai with traditional machine learning algorithms in detecting and classifying
malicious URLs using the ISCX-URL-2016 dataset. Notable contributions include insights
into the effectiveness of various algorithms and the impact of obfuscation techniques on
detection accuracy. Prior research reveals a shift towards machine learning-based solutions
due to limitations of blacklisting methods. However, there's a lack of standardized datasets
and a gap in comparing deep learning with traditional methods. Addressing these gaps, the
study aims to guide practical implementations in industry settings by considering metrics like
training and prediction times across different architectures.
The paper explores the detection of malicious URLs using machine learning techniques,
presenting a model based on random forest, SVM, DNN, and CNN. It addresses the
increasing cyber-crime threat and the challenges of efficiently extracting malicious features
from URLs. Experimentation with various algorithms and feature extraction methods
demonstrates effective malicious URL detection. The study emphasizes the limitations of
heuristic and blacklisting methods and proposes machine learning as a more adaptable
solution. Results indicate the efficacy of the proposed model, with high accuracy and reduced
false negatives. Overall, the research provides insights into combating cyber fraud and
enhancing online security through advanced detection methods.
Fraudulent URL and Credit Card Transaction Detection System Using Machine
Learning S Geetha;Yusuf Mohammed Khan;Rohan Sujay;Sai Pavan Yoganand;Rohan
B
The abstract highlights the importance of cybersecurity in combating malicious activities like
fraudulent transactions and malicious URLs. It introduces the use of machine learning
algorithms to develop APIs capable of detecting security flaws, specifically focusing on
identifying malicious URLs and fraudulent credit card transactions. The paper aims to
provide users with a web-based solution for security detection, eliminating the need for local
software installation. Previous models have shown promising accuracy rates ranging from
70% to 90%.
An Effective Approach to Classify Fraud SMS Using Hybrid Machine Learning Models
Nidhi Agrawal;Abhishek Bajpai;Kumkum Dubey;BDK Patro
The abstract introduces a model focused on detecting fraud messages, particularly SMS
messages and URLs, due to their increasing proliferation. The model consists of two stages:
SMS message classification using a hybrid model, and URL examination. The hybrid model
incorporates Naive Bayes Classifier, Random Forest, and Extra Tree Classifier, achieving
high accuracy and precision rates individually. Overall, the hybrid model outperforms other
machine learning approaches, demonstrating an accuracy of 96.86% and a precision of
99.366% in dataset analysis. The introduction highlights the prevalence of SMS phishing
Cloud Assisted Smart Learning System
(smishing) attacks and their detrimental effects, emphasizing the need for effective detection
mechanisms in combating fraudsters' activities.
Fraud Detection Using Optimized Machine Learning Tools Under Imbalance Classes,
Mary Isangediok, Kelum Gajamannage
The abstract introduces a detection scheme for malicious advertising attacks on the web,
focusing on characteristics and proposing a strategy based on URL depth. The scheme
utilizes Nutch and Modsecurity, enhancing Nutch to identify suspicious URLs and improving
Modsecurity for response filtering. Comparison with the 360 secure browser demonstrates
the scheme's effectiveness in detecting malvertising cases. The paper addresses the growing
concern of online advertising as a vector for cyber attacks and proposes a comprehensive
detection approach.
The existing systems for detecting fraudulent websites primarily rely on traditional methods
such as rule-based systems or feature engineering, which involve manually defining rules or
designing features to identify fraudulent URLs. These methods often require significant
human effort and struggle to keep up with the constantly evolving tactics employed by
fraudsters.
In recent years, deep learning-based approaches have gained attention for fraudulent URL
classification. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs),
such as long short-term memory (LSTM) networks, have been utilized to extract features
from URLs and classify them as fraudulent or legitimate. However, these approaches may
face limitations in capturing the intricate linguistic and structural patterns present in
fraudulent URLs, leading to suboptimal performance.
Some studies have proposed modifications to traditional deep learning models, such as using
multi-layer CNNs or attention mechanisms in RNNs, to improve their performance in
Cloud Assisted Smart Learning System
detecting fraudulent websites. However, these methods may still struggle to effectively adapt
to new and unseen types of fraudulent URLs.
Overall, while existing systems have made strides in detecting fraudulent websites, there
remains room for improvement in terms of accuracy, robustness, and adaptability to changing
fraud tactics.
The problem statement revolves around the inefficiency of existing methods in accurately
and efficiently detecting fraudulent websites due to their reliance on manual rule-making or
feature engineering. Traditional approaches struggle to keep pace with the evolving tactics
used by fraudsters, leading to suboptimal performance. Moreover, conventional deep
learning models may fail to capture the complex linguistic and structural patterns present in
fraudulent URLs. As a result, there is a pressing need to develop a more robust and adaptive
detection system leveraging advanced techniques such as pre-trained language models.
       The proposed system employs pre-trained language models like BERT and GPT for
        detecting fraudulent websites.
       These language models automatically extract rich feature representations from URL
        data, enhancing classification accuracy.
       By leveraging linguistic knowledge encoded within the models, the system improves
        adaptability to evolving fraud tactics.
       Reduced reliance on manual intervention enhances efficiency and effectiveness in
        detecting fraudulent activities online.
       Experimental validation demonstrates superiority over traditional methods and
        conventional deep learning models in accuracy, robustness, and adaptability.
Cloud Assisted Smart Learning System
CHAPTER III
      Security: Implement robust security measures to protect sensitive user data and
       ensure secure communication channels to prevent unauthorized access or tampering.
      Performance: Ensure that the system processes URLs efficiently, with response
       times under a predefined threshold, to facilitate timely detection of fraudulent
       activities.
      Usability: Provide an intuitive and user-friendly interface for administrators and end-
       users, with clear navigation and informative feedback, to enhance user experience and
       adoption.
      Programming Language and Frameworks: Utilize Python along with deep learning
       frameworks like TensorFlow or PyTorch for model development and training.
      Pre-Trained Language Models: Incorporate pre-trained language models such as
       BERT or GPT for feature extraction from URLs.
      Database Management System: Implement a DBMS like PostgreSQL or MongoDB
       for storing URL data and classification results.
      Integrated Development Environment (IDE) and Version Control: Use an IDE like
       PyCharm along with version control systems like Git for code development,
       debugging, and collaboration.
      Deployment Platforms and Monitoring Tools: Determine deployment platforms (e.g.,
       AWS, Azure) and integrate monitoring and logging tools for tracking system
       performance and troubleshooting.
Hardware Requirements:
CHAPTER IV
                      SYSTEM ARCHITECTURE
4.1 DESIGN
CHAPTER V
IMPLEMENTATION
The model is consisting of five essential layers, namely as (1) infrastructure layer, (2)
platform layer, (3) services layer (4) clients-access layer and (5) user layer. The first layer is
the hardware layer. It includes all the hardware, computing and storage capacity for the high-
level layer. The infrastructure layer contains resources and architecture that supporting
infrastructure, such as virtual machine, cloud platform. It shares the IT infrastructure
resources and connects the system huge system pool together to provide services. The cloud
computing enables the hardware and infrastructure layers to work like internet/intranet. Then,
the data resources can be accessed in secure as well as scalable way.
The second layer is the platform layer; the software resource layer consists of middleware
and operating system. Different software resources are integrated by the technology of
middleware to develop a unified interface for software developers to develop applications
and embed them in the cloud.
The third layer is the service layer; namely SaaS. In SaaS, the cloud computing service is
provided to customers. Web Services, Multimedia Applications, Business Applications are
examples of the provided services. The client-access layer is the fifth layer of our proposed
architecture. The access layer which consists of multi-channel access from multi devices for
addressing the access issue to cloud e-learning services which is available on the architecture
such as types of access devices and presentation models.
Cloud Assisted Smart Learning System
The Upper Sub-Layer: The security is an important issue in the cloud system. This is because
the services in the cloud system are accessed over the internet. Each client can select its own
security methods such as the needed encryption process. Furthermore, the cloud system has
to agree the all methods with the local server to interpret them. As well as, the users in our
educational system are at several levels so the request for services is diverse. The access
method will be maintained by identifying the services and user types.
The policy among the user and provider will be defined by the sub-layer and will be
depended on multiple factors. Examples of these factors are the user level, the latency and
the throughput. Based on the policy, different priorities are set by the government for the
users. For example, the higher priority users can access the resources with lower latency. The
policy also guarantees the provider to run the software smoothly with maximum throughput
and highest load balance. Moreover, an authentication and credit verification sub-layer are
required in this layer to verify the local server as soon as a request for resources is coming
from the server end. It also authenticates and verifies the architecture of ELECCM system
user credit information for the requested service; if he has sufficient balances for the
requested services it accepts and transform the requests to the lower sub-layer. As soon as the
lower sub-layer confirms the request it adjusts the user account after deducting the amount
for the requested service. Rules by the Government are set; they named the planning and
monitoring committee. For example, the planning committee decides the prices for different
types of services based on analysis and agreement with the cloud partners. It also decides the
number of funds needed to be allocated to an individual organization.
Cloud Assisted Smart Learning System
The corruption monitoring committee monitors the daily proceedings of every institute and
all objections come from the users’ end (e.g. unmatched software). The Lower Sub-Layer:
The lower layer of the cloud architecture allows accessing the private resources that are user
request. The lower layer is waiting for the positive acknowledgement that will be sent from
the upper layer. Once the lower layer receives the positive acknowledgement, it provides the
user requested services. The interaction will be established between the vendors and clients
under the responsibility of an instrumental panel in the layer. The layer has an operational
panel in which it performs different tasks such as monitors the circumstances, handling the
PCs and managing the images.
CHAPTER VI
                                FUTURE SCOPE
The following fields may be included in the future scope of this study
   
Towards Fraudulent URL Classification with Large Language Model based on Deep Learning
CHAPTER VII
                                    APPLICATIONS
       Applications of the Fraudulent Website Detection System:
              Social Media Networks: Social media platforms can employ the system to
               detect and remove fake accounts, fraudulent content, and malicious links,
               safeguarding users from scams and misinformation.
              Cybersecurity Companies: Security firms can integrate the system into their
               platforms to provide enhanced protection against web-based threats, including
               malware distribution and data breaches.
CHAPTER VIII
                                   CONCLUSION
 In conclusion, the Fraudulent Website Detection System offers a robust solution to address
 the growing threat of online fraud. By leveraging advanced techniques in natural language
 processing and deep learning, the system can accurately identify fraudulent websites across
 various domains, including finance, e-commerce, social media, and more. Its applications
 span across industries, providing invaluable protection to financial institutions, online
 platforms, government agencies, healthcare providers, and individual users alike.
 With its capability to detect phishing scams, counterfeit products, fake accounts, and other
 fraudulent activities in real-time, the system ensures a secure online environment for users
 and helps mitigate financial losses and reputational damage for businesses. Moreover, its
 scalability, adaptability, and potential for future enhancements position it as a crucial tool in
 the ongoing fight against online fraud.
REFERENCES
       [1] D. Sahoo, C. Liu and S. C. H. Hoi, "Malicious URL detection using machine
       learning: A survey," in Proceedings of the IEEE International Conference on Big
       Data, 2017, pp. 3805-3814.
       [7] M. T. Ribeiro, S. Singh and C. Guestrin, ""Why should I trust you?" Explaining
       the predictions of any classifier," in Proceedings of the 22nd ACM SIGKDD
       international conference on knowledge discovery and data mining, 2016, pp. 1135-
       1144.