Project Document
Project Document
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
By
Sri. M. ANAND
Assistant Professor
Department of C.S.E.
2023 – 2024
Department of Computer Science and Engineering
G. PULLA REDDY ENGINEERING COLLEGE (Autonomous): KURNOOL
(Affiliated to JNTUA, ANANTAPURAMU)
CERTIFICATE
This is to certify that the Project Work entitled ‘Phishing
Website Detection Using Machine Learning’ is a bonafide record
of work carried out by
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
…………………………….. ……………………………..
Sri. M. Anand Dr. N. Kasiviswanath
Assistant Professor, Professor & Head of the Department,
Department of C.S.E., Department of C.S.E.,
G. Pulla Reddy Engineering College, G. Pulla Reddy Engineering College,
Kurnool. Kurnool.
DECLARATION
B. Hemanth
(209X1A05J7)
iii
ACKNOWLEDGEMENT
Finally, we wish to thank all our friends and well wishers who have helped
us directly or indirectly during the course of this project work.
iv
ABSTRACT
v
CONTENTS
CHAPTER PAGE NO
1. INTRODUCTION 11-17
1.1 Introduction 12
1.3.1 Limitations 14
1.6 Objectives 15
1.7 Methodology 16
Attack Detection 19
4.1 Introduction 24
vi
4.2.2 Operational Feasibility 26
4.2.3 Economic Feasibility 26
4.3 Dataset 27
4.4.4 ML Packages 33
4.4.5 ML Libraries 34
6.1 Introduction 51
vii
6.2 Types of Classifiers 51
6.6 Deployment 58
6.7 Prediction 60
7.3 Screenshots 67
8.1 Conclusion 71
9. REFERENCES 72
viii
LIST OF FIGURES
SI.NO FIGNO NAME PAGE NO
ix
LIST OF TABLES
x
PHISHING WEBSITE DETECTION USING MACHINE LEARNING
INTRODUCTION
1. INTRODUCTION
1.1 INTRODUCTION
Phishing is a fraudulent technique that uses social and technological tricks to steal
customer identification and financial credentials. In our daily life, we carry out most of
our work on digital platforms. Using a computer and the internet in many areas facilitates
our business and private life. It allows us to complete our transaction and operations
quickly in areas such as trade, health, education, communication, banking, aviation, research,
engineering, entertainment, and public services. The users who need to access a local
network have been able to easily connect to the Internet anywhere and anytime with the
development of mobile and wireless technologies. Although this situation provides great
convenience, it has revealed serious deficits in terms of information security. Thus, the
need for users in cyberspace to take measures against possible cyber-attacks has emerged.
These attacks are mainly targeted in the following areas: fraud, forgery, force,
shakedown, hacking, service blocking, malware applications, illegal digital contents and
social engineering. According to Kaspersky's data, the average cost of an attack in 2019
(depending on the size of the attack) is between $ 108K and $ 1.4 billion. In addition, the
money spent on global security products and services is around $ 124 billion. Among
these attacks, the most widespread and also critical one is “phishing attacks”. It causes
pecuniary loss and intangible damages.
In United States businesses, there is a loss of US$2billion per year because their
clients become victim to phishing. In 3rd Microsoft Computing Safer Index Report
released in February 2014, it was estimated that the annual worldwide impact of phishing
could be as high as $5 billion. Phishing attacks are becoming successful because lack of
user awareness. Since phishing attack exploits the weaknesses found in users, it is very
difficult to mitigate them but it is very important to enhance phishing detection
techniques. The general method to detect phishing websites by updating blacklisted URLs,
Internet Protocol (IP) to the antivirus database which is also known as “blacklist" method.
To evade blacklists attackers uses creative techniques to fool users by modifying the URL
to appear legitimate via obfuscation and many other simple techniques including: fast-
flux, in which proxies are automatically generated to host the web -page; algorithmic
generation of new URLs etc.
Spam: Website performs the act of attempting to flood the user with advertising or sites
such as fake surveys and online dating etc.
Malware: Website created by attackers to disrupt computer operation, gather sensitive
information, or gain access to private computer systems.
There is a significant chance of exploitation of user information. For these reasons,
phishing in modern society is highly urgent, challenging, and overly critical. The method
of reaching target users in phishing attacks has continuously increased since the last
decade. This method has been carried out in the 1990s as an algorithm-based, in the early
2000s based on e- mail, then as Domain Spoofing and in recent years via HTTPs. Due to
the size of the mass attacked in recent years, the cost and effect of the attacks on the users
have been high.
The average financial cost of the data breach as part of the phishing attacks in 2019 is $
3.86 million, and the approximate cost of the BEC (Business Email Compromise) phrases
is estimated to be around $12 billion. Also, it is known that about 15% of people who are
attacked are at least one more target. With this result, it can be said that phishing attacks
will continue to being carried out in the ongoing years. So, we proposed a system with
the help of machine learning techniques and algorithms like Logistic Regression, KNN,
SVC, Random Forest, Decision Tree, XGB Classifier and Naïve Bayes to predict Phishing
Website based on different parameters like extracted by the website link entered by the
user in the front end.
1.2 PROBLEM STATEMENT
1. The Cyber-attacks are growing faster than usual rate, it became evident that
necessary steps should be taken in-order to get them under control. Among various
cyber-attacks, Phishing websites is one of the popular and commonly used attack
to steal users personal information and financial information by manipulating the
website URL and IP addresses.
2. The main focus in this project is to implement the better model for detecting these
phishing websites using ML algorithms.
1.3 EXISTING SYSTEM
“Phishing Detection Using Machine Learning”: This paper proposes an approach of
phishing detection system to detect blacklisted URL also known as phishing websites, so
that individual can be alerted while browsing or accessing a particular website. Therefore,
it can be utilized for identification and authentication and become a legitimate tool to
prevent an individual from getting tricked. The system fosters many features in
comparison of other software. Its unique features such as capturing blacklisted URL’s
from the browser directly to verify the validity of the website, notifying user on
blacklisted websites while they are trying to access through popup, and also notifying
through email. This system will assist user to be alert when they are trying to access a
blacklisted website.
1.3.1 LIMITATIONS
i. As a list of blacklisted URLs is being used to detect the accuracy of predicting the
correct results may be very low. There isn’t any standard list of URLs published
by any standard organization.
ii. As a part of daily life, we may encounter many new URL’s which seems to be
same as original ones and not present in the list. It is hard to differentiate URL’s.
1.4.1 ADVANTAGES
a. In today’s society, as the phishing attacks have become evident, the need of
counteractions are necessary.
b. The main problem of Phishing has raised abundantly in the last 3 years due to Covid –
c. 19. Number of Cases encountering on a daily basis has increased by 125% in 2021 from
2020. If this is the scenario happening, the phishing cases may raise up to 500 per day
by 2025.
d. Phishing detection techniques do suffer low detection accuracy and high false alarm
especially when novel phishing approaches are introduced.
e. Besides, the most common technique used, blacklist-based method is inefficient in
responding to emanating phishing attacks since registering new domain has become
easier, no comprehensive blacklist can ensure a perfect up-to-date database.
f. Therefore, we are developing a website to predict the phishing website URL’s which
helps the society to cut down these attacks and save themselves from the frauds.
1.6 OBJECTIVES
a. To develop a novel approach to detect malicious URL and alert users.
b. To apply ML techniques in the proposed approach in order to analyze the real time
URLs and produce effective results.
1.7 METHODOLOGY
a. To Take a dataset and divide it into two parts training and testing datasets in different
ratios.
b. Implement a fine approach to detect phishing attacks using various machine learning
algorithms.
c. The algorithm which gives the better accuracy rate comparatively is taken as our final
prediction model along with the lexical features combined.
Chapter 4
System Analysis: System Analysis is a document that describes about the existing
system and proposed system in the project. And also describes about advantages and
disadvantages in the project.
Chapter 5
System design: System design is a document that describes about the project modules,
Activity diagram, Use Case Diagram, Data Flow Diagram, and Sequence Diagram
detailed in the project.
Chapter 6
Implementation: Implementation is a document that describes about the detailed
concepts of the project. Also describes about the algorithm with their detailed steps. And
also, about the codes for implementation of the algorithm.
Chapter 7
Testing
Testing is a document that describes about the
a. Methods of testing
This contains information about Unit testing, Validation testing, Functional testing,
Integration testing, User Acceptance testing.
b. Test Cases
In Test Cases we contain t h e detailed description about program Testcases.
Chapter 8
Conclusion and Future Enhancement: Conclusion and Future Enhancement is a
document that describes about the brief summary of the project and undetermined
events that will occur in that time.
LITERATURE SURVEY
2. LITERATURE SURVEY
In this paper, we offer an intelligent system for detecting phishing websites. The system
acts as an additional functionality to an internet browser as an extension that automatically
notifies the user when it detects a phishing website. The system is based on a machine learning
method, particularly supervised learning. We have selected the Random Forest technique due to
its good performance in classification. Our focus is to pursue a higher performance classifier
by studying the features of phishing website and choose the better combination of them to train
the classifier. As a result, we conclude our paper with good accuracy and combination of 26
features.
Remarks: Low ranging values reduces accuracy and increases execution time.
In this paper, they proposed a machine learning-based phishing detection system by using
eight different algorithms to analyze the URLs, and three different datasets to compare the
results with other works. The experimental results depict that the proposed models have an
outstanding performance with a success rate. In this paper, we aimed to implement a phishing
detection system by using some machine learning algorithms. The proposed systems are tested
with some recent datasets in the literature and reached results are compared with the newest
works in the literature. The comparison results show that the proposed systems enhance the
efficiency of phishing detection and reach very good accuracy rates. As future works, firstly, it
is aimed to create a new and huge dataset for URL based Phishing Detection Systems. With the
use of this dataset, we plan to enhance our system by using some hybrid algorithms, and also deep
learning mode.
Remarks: Implementing with many algorithms and with various datasets may provide better
results compared to previous works but lexical features of an URL are not taken into
consideration which provides optimal results with better accuracy rates.
In this paper, the author has discussed three approaches for detecting phishing websites.
First is by analyzing various features of URL, second is by checking legitimacy of website by
knowing where the website is being hosted and who are managing it, and the third approach is
visual appearance based analysis for checking genuineness of website. The authors have used
Machine Learning techniques and algorithms for evaluation of these different features of URL
and websites.
Remarks: A particular challenge in this domain is that criminals are constantly making new
strategies to counter our defense measures. To succeed in this context, we need algorithms that
continually adapt to new examples and features of phishing URL’s.
SYSTEM SPECIFICATIONS
Software requirements encompass the essential elements necessary for the development,
implementation, and functioning of a software system. These typically include the specification
of programming languages, frameworks, and libraries required for development, as well as the
need for specific databases or data storage solutions. Overall, software requirements serve as a
comprehensive guideline, outlining the technological, functional, and operational prerequisites
vital for the successful deployment and performance of a software application.
SYSTEM ANALYSIS
4. SYSTEM ANALYSIS
4.1 INTRODUCTION
System analysis is a thorough examination of the current software or the business
processes that the software aims to address. The goal is to understand the intricacies, identify
challenges, and define opportunities for improvement. Analysts gather information from
stakeholders, existing documentation, and through direct observations to comprehend the
software's functionalities, user requirements, and the broader context in which it operates.
During system analysis in a software project, the emphasis is on defining clear and
comprehensive requirements. This includes understanding user needs, business rules, and any
constraints that might impact the design and development of the software. Analysts aim to
bridge the gap between the current state of the software and what is needed for it to align
effectively with organizational goals and user expectations.
The insights gained from system analysis guide subsequent phases of the software
development life cycle. This includes system design, coding, testing, and implementation. The
thorough understanding of user requirements and system functionalities obtained during the
analysis phase helps in creating a software solution that is not only technically sound but also
addresses the practical needs and challenges faced by end-users and the organization. In
essence, system analysis in a software project serves as the cornerstone for making informed
decisions and laying the groundwork for successful software development and implementation.
Furthermore, in a software project, system analysis involves assessing the feasibility of
the proposed changes or developments. This includes evaluating the technical, operational, and
economic aspects to determine the practicality and viability of implementing the suggested
solutions. System analysis sets the stage for effective project planning, resource allocation, and
risk management, ensuring that the subsequent phases of the software development life cycle
proceed with a well-defined understanding of the system's requirements and the strategic
objectives of the project.
Scalability and flexibility are also critical considerations in technical feasibility. The
assessment revolves around determining whether the proposed solution can adapt to future
growth and changes in requirements without necessitating substantial modifications or
disruptions. A scalable and flexible technological infrastructure is essential for accommodating
the evolving needs of the organization.
4.2.2 OPERATIONAL FEASIBILITY
Operational feasibility is a crucial aspect of a feasibility study that assesses the
practicality and viability of implementing a proposed project within an organization's existing
operational environment. This evaluation focuses on whether the proposed solution can
seamlessly integrate into the daily business operations and whether it is acceptable and
adaptable for end-users. The primary goal is to ensure that the project aligns with the
organizational processes and can be effectively utilized by stakeholders without causing
disruptions.
Key considerations in operational feasibility include user acceptance, training
requirements, and potential changes in roles and responsibilities. Evaluating whether the
proposed project can be seamlessly integrated into the day-to-day activities of the organization
ensures that the implementation process is smooth and does not adversely affect productivity.
A positive operational feasibility assessment indicates that the proposed solution is not only
technically and economically viable but is also practical and operationally sound within the
context of the organization's current operational landscape.
4.2.3 ECONOMIC FEASIBILITY
Economic feasibility is a critical component of a feasibility study that evaluates the
financial viability and potential economic benefits of a proposed project. This assessment
involves a thorough analysis of the costs associated with project development and
implementation against the expected economic returns and benefits. The primary objective is
to determine whether the project is financially justifiable and aligns with the organization's
budgetary constraints.
In the economic feasibility analysis, various costs are considered, including development
costs, operational costs, maintenance costs, and any other expenditures associated with the
project's lifecycle. These costs are compared to the anticipated benefits, which may include
revenue generation, cost savings, increased efficiency, or other economic advantages. This
comparison helps stakeholders understand the financial implications of undertaking the project
and whether the expected benefits outweigh the incurred costs.
4.3 DATASET
The dataset used in this machine learning project was obtained from Kaggle, a well-
known platform for data science competitions and datasets. 11430 URLs with 30 retrieved
characteristics are part of the supplied dataset. The dataset is intended to serve as the benchmark
for phishing detection systems that employ machine learning. The collection comprises
precisely 45% genuine URLs and 55% phishing URLs. Each instance contains 30 features.
Each feature is associated with a rule. If the rule satisfies, it is termed as phishing. If the rule
doesn‘t satisfy then it is termed as legitimate. The features take three discrete values. 1 if the
rule is satisfied, 0 if the rule is partially satisfied and -1 if the rule is not satisfied. For this
research implementation, this dataset is used since it is the most recent dataset accessible in the
public domain.
1. Having an IP Address
If an IP address is used in the URL instead of the domain name, such
as http://217.102.24.235/sample.html.
2. Length of URL
Phishers may conceal the suspicious element of the URL in the address bar by
using a lengthy URL.
3. URL Shortening Service
Provides access to a website with a lengthy URL. The URL http://sharif.hud.ac.uk/,
for example, may be abbreviated to bit.ly/1sSEGTB.
4. Using the @ sambol
@ in the URL causes the browser to disregard anything before the @ symbol, and the
true address often follows the @ symbol.
5. Double Slash Redirection
The presence of / in a URL indicates that the user will be redirected to another website.
6. Prefix Suffix
Phishers often add prefixes or suffixes separated by (-) to domain names in order to
give visitors the impression that they are dealing with a reputable website.
7. Using a Subdomain
Using a subdomain in the URL.
8. SSL Status
Indicates whether or not a website employs SSL.
9. Domain Registration Length
Because a phishing website only exists for a brief time,
10. Favicon
A favicon is a visual image (icon) that is connected with a particular website. If the
favicon is loaded from a domain different than the one displayed in the address bar, the
site is most certainly a Phishing attempt.
11. Using Non-Standard Ports
It is much preferable to just open the ports that you need to regulate invasions.
Several firewalls, proxy servers, and Network Address Translation (NAT) servers
will, by default, block all or most of the ports and only allow access to those that are
explicitly allowed.
12. HTTPS token
Using a deceptive https token in the URL. For instance, http://www.mellat-phish.ir
Non-functional requirements are the requirements which are not directly concerned with
the specific function delivered by the system. They specify the criteria that can be used to judge
the operation of a system rather than specific behaviors. They may relate to emergent system
properties such as reliability, response time and store occupancy. Non-functional requirements
arise through the user needs, because of budget constraints, organizational policies, the need
for interoperability with other software and hardware systems or because of external factors
such as:
1. Product Requirements
2. Organizational Requirements
3. User Requirements
4. Basic Operational Requirements
Some of them are as follows:
• Reusability
The same code with limited changes can be used for detecting phishing attacks variants
like smishing, vishing, etc.
• Maintainability
The implementation is very basic and includes print statements that makes it easy to debug.
• Usability
The software used is very user friendly and open source. It also runs on any operating
system.
• Scalability
The implementation can include detection of vishing, smishing, etc.
a. For desired performance, transferred data size, speed of connection, response time,
processing speed must be considered.
b. System should work real – time which means there should be an acceptable time delay
between request and response.
c. The system should be reliable to use by the user.
4.4.4 ML PACKAGES
1. NumPy
NumPy is a general-purpose array-processing package. It provides a high-
performance multidimensional array object, and tools for working with these arrays. It
is the fundamental package for scientific computing with Python. It is a Python library
that provides a multidimensional array object, various derived objects (such as masked
arrays and matrices), and an assortment of routines for fast operations on arrays,
including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete
Fourier transforms, basic linear algebra, basic statistical operations, random simulation.
2. Pandas
Pandas is a python package designed for fast and flexible data processing,
manipulation and analysis. Pandas has a number of fundamental data structures (a data
management and storage format). Pandas Data Frame is two-dimensional size-mutable,
potentially heterogeneous tabular data structure with labeled axes (rows and columns).
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular
fashion in rows and columns.
3. Scikit – learn
Scikit-learn provides a range of supervised and unsupervised learning algorithms
via a consistent interface in Python. It is licensed under a permissive simplified BSD
license and is distributed under many Linux distributions, encouraging academic and
commercial use. The vision for the library is a level of robustness and support required
for use in production systems. This means a deep focus on concerns such as easy of use,
code quality, collaboration, documentation and performance.
4. Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays.
Matplotlib is a multi-platform data visualization library built on NumPy arrays and
designed to work with the broader SciPy stack. Matplotlib comes with a wide variety
of plots. Plots helps to understand trends, patterns, and to make correlations. They’re
typically instruments for reasoning about quantitative information. Pyplot is a
Matplotlib module which provides a MATLAB-like interface. Matplotlib is designed
to be as usable as MATLAB, with the ability to use Python and the advantage of being
free and open-source. Each pyplot function makes some change to a figure: e.g., creates
a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates
the plot with labels, etc. The various plots we can utilize using Pyplot are Line Plot,
Histogram, Scatter, 3D Plot, Image, contour, and Polar.
5. Seaborn
Seaborn is an open-source Python library built on top of matplotlib. It is used for
data visualization and exploratory data analysis. Seaborn works easily with dataframes
and the Pandas library. The graphs created can also be customized easily. As Matplotlib
is also used for the same purpose of Data Visualization, Seaborn uses fascinating themes
whereas Matplotlib is used only for Basic Graphs.
6. Flask
Flask is a web framework, it’s a Python module that lets you develop web
applications easily. It’s has a small and easy-to-extend core: it’s a microframework that
doesn’t include an ORM (Object Relational Manager) or such features. It does have
many cool features like URL routing, template engine. It is a WSGI web app framework.
7. PyMySQL
PyMySQL is a pure-Python MySQL client library, based on PEP 249. Most public
APIs are compatible with mysqlclient and MySQLdb. PyMySQL works with MySQL
5.5+ and MariaDB 5.5+.
4.4.5 ML LIBRARIES
1. Whois
pywhois is a Python module for retrieving WHOIS information of domains.
pywhois works with Python 2.4+ and no external dependencies.
2. Xgboost
XGBoost (Extreme Gradient Boosting) belongs to a family of boosting algorithms
and uses the gradient boosting (GBM) framework at its core. It is an optimized
distributed gradient boosting library.
3. Favicon
Favicons are used in browser tabs, browser history, toolbar apps, bookmarks
dropdown, search bar, and search bar recommendations. In all of these, especially in
the bookmarks and history tabs, that consist of lists of URLs all looking the same, the
favicon makes it faster to find that web-site you're looking for.
4. Requests
Requests library is one of the integral part of Python for making HTTP requests
to a specified URL. Whether it be REST APIs or Web Scraping, requests is must to be
learned for proceeding further with these technologies. When one makes a request to a
URI, it returns a response. Python requests provides inbuilt functionalities for managing
both the request and response.
5. Beautiful soup
Python library that is used for web scraping purposes to pull the data out of HTML
and XML files. It creates a parse tree from page source code that can be used to extract
data in a hierarchical and more readable manner.
6. Google Search
If you want to develop a search service utilizing the power of Google search, you
can do so using the google module in Python. You can use this to develop a backend
service for a desktop application or implement a website search or app search with the
python code running on your server.
improve their performance over time as they encounter more data, making them versatile tools
for tasks such as classification, regression, clustering, and pattern recognition.
In the context of phishing website detection, machine learning is a pivotal tool employed
through a systematic process. It begins with the compilation of a labeled dataset, encompassing
both phishing and legitimate websites. Features, such as URL structure and content analysis,
are then extracted from this dataset. Following data preprocessing, an appropriate machine
learning algorithm is selected, and the model is trained to recognize patterns distinguishing
between malicious and genuine websites. Validation and hyperparameter tuning ensure the
model's efficacy, with evaluation metrics like accuracy and precision guiding the optimization
process. Once validated, the model is deployed for real-time detection, often integrated into
web browsers or email clients. Continuous monitoring and updates are crucial, given the
evolving nature of phishing techniques, and measures are taken to enhance the model's
robustness against adversarial attacks. The integration of the machine learning model into
broader cybersecurity systems provides a multi-layered defense against phishing threats. This
comprehensive approach, combining machine learning with other security measures,
strengthens the overall security posture and reduces the risk of falling victim to phishing
attacks.
was extensible. This led to the design of a new language which was later named Python.
Python is a great and friendly language to use and learn. and can be adapted to both small
and large projects. Python will cut a project's development time greatly and overall, it's much
faster to write Python than other languages. Sometimes only Python code is used for a program,
but most of the time it is used to do simple jobs while another programming language is used
to do more complicated tasks. Its standard library is made up of many functions that come with
Python when it is installed. Represent your data, a few of the most important machine learning
algorithms, and how to evaluate the performance of your machine learning algorithm.
4.5.3 PYTHON PROGRAM USING ANACONDA
Use Anaconda Navigator to launch an application. Then, create and run a simple Python
program with Jupyter Notebook.
Anaconda is a free and open-source distribution of the Python and R programming
language for scientific computing ( data science , machine learning applications, large-scale
data processing, predictive analytics etc., )that aims to simplify package management and
deployment. Package versions are managed by the package management system conda. The
Anaconda distribution is used by over 12 million users and includes more than 1400 popular
data-science packages suitable for Windows, Linux, and MacOS
4.5.3.1 OPEN ANACONDA NAVIGATOR
Choose the instructions for your operating system. Click the Start icon and search for the
Navigator
and logic programming. Python uses dynamic typing, and a combination of reference counting
and a cycle-detecting garbage collector for memory management.
It also features dynamic name resolution (late binding), which binds method and variable names
during program execution. Python's design offers some support for functional programming in
the Lisp tradition. It has filter(), map() and reduce() functions; list comprehensions,
dictionaries, sets and generator expressions.
The language's core philosophy is summarized in the document The Zen of Python (PEP
20), which includes aphorisms such as:
• Beautiful is better than ugly
• Explicit is better than implicit
• Simple is better than complex
• Complex is better than complicated
• Readability counts
SYSTEM DESIGN
SYSTEM DESIGN
Figure 5.4.1 depicts how the user interacts with the system to achieve his main objective. User
has to register if he’s new to the system or login, then he/she has to enter URL they want to check. The
URL thus provided is subjected to extraction for features and the extracted features are provided as input
to the prediction phase. In the prediction phase, with the saved model (Random Forest Model) the
features are checked and the result is printed on the screen. If the result is -1 the output is given
as “Website is not safe to use”. Otherwise, the output is displayed as “Website is safe to use”.
Figure 5.4.3 depicts the flow of actions by the user to get the final output. User has to first
register. Next step is login. When the user provides login credentials, they are checked in the
database and validated. If the details are found, the user is moved to the next screen where the
URL has to be entered. Else, the user is asked to reenter the credentials. The URL entered is
provided as input to extraction phase where the required features are extracted from the URL
into a space matrix and this matrix is provided as input to the prediction phase. The predicted
result by the ML model is provided as output to the user. If the result is -1 the output is given
as “Website is not safe to use”. Otherwise, the output is displayed as “Website is safe to use”.
IMPLEMENTATION AND
RESULT ANALYSIS
6.1 INTRODUCTION
In the context of phishing website detection, machine learning is a pivotal tool employed
through a systematic process. It begins with the compilation of a labeled dataset, encompassing
both phishing and legitimate websites. Features, such as URL structure and content analysis,
are then extracted from this dataset. Following data preprocessing, an appropriate machine
learning algorithm is selected, and the model is trained to recognize patterns distinguishing
between malicious and genuine websites. Validation and hyperparameter tuning ensure the
model's efficacy, with evaluation metrics like accuracy and precision guiding the optimization
process. Once validated, the model is deployed for real-time detection, often integrated into
web browsers or email clients. Continuous monitoring and updates are crucial, given the
evolving nature of phishing techniques, and measures are taken to enhance the model's
robustness against adversarial attacks. The integration of the machine learning model into
broader cybersecurity systems provides a multi-layered defense against phishing threats. This
comprehensive approach, combining machine learning with other security measures,
strengthens the overall security posture and reduces the risk of falling victim to phishing
attacks.
6.2 TYPES OF CLASSIFIERS
6.2.1 LOGISTIC REGRESSION
Logistic Regression, despite its name, is a versatile algorithm not limited to binary
classification; it can be extended for multi-class text classification tasks. In this context, it
works by modeling the relationship between the input features (word frequencies in the case of
text) and the probability of a document belonging to each class using the softmax function. The
model estimates a separate probability for each class, and the class with the highest probability
is assigned as the final prediction.
6.2.2 SUPPORT VECTOR MACHINE
Support Vector Machines (SVMs) are powerful classifiers widely applied to multi-class
text classification tasks. SVMs operate by finding an optimal hyperplane in a high-
dimensional space that best separates the data points corresponding to different classes. In the
context of text classification, each feature represents the frequency of a word in a document,
and the SVM seeks to create a decision boundary that maximizes the margin between
different classes.
Random Forests achieve a reduction in overfitting by combining many weak learners that
underfit because they only utilize a subset of all training samples Random Forests can handle
a large number of variables in a data set. Also, during the forest construction process, they make
an unbiased estimate of the generalization error. Besides, they can estimate the lost data well.
The main drawback of Random Forests is the lack of reproducibility because the process of
forest construction is random. Besides, it is difficult to interpret the final model and subsequent
results, because it involves many independent decision trees.
6.2.7 GRADIENT BOOSTING CLASSIFIER
A Gradient Boosting Classifier is a powerful machine learning algorithm that is
commonly used for phishing website detection. It belongs to the ensemble learning family and
is often employed to improve the accuracy and robustness of binary classification tasks, such
as distinguishing between legitimate and phishing websites. Gradient Boosting is an ensemble
technique that combines multiple weak learners (typically decision trees) to create a strong
predictive model. The primary idea is to iteratively add decision trees, with each new tree
correcting the errors made by the previous ones. This process is guided by the gradient of a loss
function, which measures the difference between the predicted and actual values
6.2.8 CAT BOOST CLASSIFIER
CatBoost is a powerful and user-friendly algorithm that excels in handling categorical
features, making it a valuable tool for classification tasks, especially when dealing with real-
world datasets that contain a mix of categorical and numerical features.
6.2.9 XG BOOST CLASSIFIER
XG - Boost is a refined and customized version of a Gradient Boosting to provide better
performance and speed. The most important factor behind the success of XG - Boost is its
scalability in all scenarios. The XG - Boost runs more than ten times faster than popular
solutions on a single machine and scales to billions of examples in distributed or memory-
limited settings. The scalability of XG - Boost is due to several important algorithmic
optimizations. These innovations include a novel tree learning algorithm for handling sparse
data; a theoretically justified weighted quantile sketch procedure enables handling instance
weights in approximate tree learning. Parallel and distributed computing make learning faster
which enables quicker model exploration. More importantly, XG - Boost exploits out of-core
computation and enables data scientists to process hundreds of millions of examples on a
desktop. Finally, it is even more exciting to combine these techniques to make an end -to-end
system that scales to even larger data with the least amount of cluster resources.
model1= LogisticRegression()
model2=RandomForestClassifier(random_state = 42, max_depth = 1 5, n_estimators =
200, min_samples_split = 2, min_samples_leaf = 1)
model3=XGBClassifier(n_estimators=500)
model4= KNeighborsClassifier(n_neighbors=7)
model5=DecisionTreeClassifier()
model6= CatBoostClassifier(learning_rate = 0.1)
model7= SVC(kernel = 'linear',gamma = 'scale')
model8=MLPClassifier()
model9= GradientBoostingClassifier(max_depth=n,learning_rate = 0.7)
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)
model4.fit(X_train, y_train)
model5.fit(X_train, y_train)
model6.fit(X_train, y_train)
model7.fit(X_train, y_train)
model8.fit(X_train, y_train)
model9.fit(X_train, y_train)
y_pred1=model1.predict(X_test)
y_pred2 = model2.predict(X_test)
y_pred3 = model3.predict(X_test)
y_pred4 = model4.predict(X_test)
y_pred5 = model5.predict(X_test)
y_pred6 = model6.predict(X_test)
y_pred7 = model7.predict(X_test)
y_pred8 = model8.predict(X_test)
y_pred9 =model9.predict(X_test)
acc_train = metrics.accuracy_score(y_train,y_trainmodel)
acc_test = metrics.accuracy_score(y_test,y_testmodel)
print("Accuracy on training Data: {:.3f}".format(acc_train))
print("Accuracy on test Data: {:.3f}".format(acc_test))
f1_score_train = metrics.f1_score(y_train,y_trainmodel)
f1_score_test = metrics.f1_score(y_test,y_testmodel)
print("F1_score on training Data: {:.3f}".format(f1_score_train_svc))
print("F1_score on test Data: {:.3f}".format(f1_score_test_svc))
recall_score_train = metrics.recall_score(y_train,y_trainmodel)
recall_score_test = metrics.recall_score(y_test,y_testmodel)
print("Recall on training Data: {:.3f}".format(recall_score_train))
print("Recall on test Data: {:.3f}".format(recall_score_test))
precision_score_train = metrics.precision_score(y_train,y_trainmodel)
precision_score_test = metrics.precision_score(y_test,y_testmodel)
print("Precision on training Data: {:.3f}".format(precision_score_train))
print("Precision test Data: {:.3f}".format(precision_score_test))
models compare. The criteria for selecting the best model should be defined, taking into account
the specific nature of the problem. This might involve prioritizing a particular metric,
depending on whether accuracy, precision, or recall is of greater importance. Cross-validation
is a critical step to ensure the robustness of performance metrics, as it provides estimates of
model performance on different subsets of the data. Ensemble methods, such as bagging or
boosting, can be explored to combine predictions from multiple models and potentially improve
overall performance. Additionally, domain knowledge and practical considerations should play
a role in the decision-making process. Sometimes, a model that slightly underperforms
according to traditional metrics may be more suitable for deployment based on other factors
like interpretability or resource requirements. Hyperparameter tuning can be performed on the
selected model(s) to further optimize performance, and the final step involves evaluating the
chosen model(s) on a separate test set not used during training or model selection. This step
provides an unbiased estimate of the model's generalization performance and ensures that the
selected model performs well on new, unseen data. The entire process of model selection is
iterative, and it's important to continuously refine and validate choices based on ongoing
analyses and feedback from the real-world application.
Hence Gradient Boosting Classifier Algorithm is considered as the best model among
them and it is selected as the final model. Now to save this trained model in Python, we can
use the “pickle” or “joblib” module.
6.5 WHY GRAIDIENT BOOSTING ALGORITHM
Main.py
def convertion(url,prediction):
name = []
if(prediction==1):
return [url,"Safe","Continue"]
else:
return [url,"Not Safe","Still want to Continue"]
6.7 PREDICTION
In the deployed website for phishing website detection, the prediction process unfolds in
several sequential steps. Users initiate the process by providing input data, typically in the form
of a URL or relevant features associated with a website. Following this, the input data
undergoes preprocessing to ensure proper formatting and alignment with the model's training
data. The pre-trained machine learning model, previously saved and deployed, is then loaded
into the website's memory, allowing access for making predictions. If necessary, feature
extraction or additional processing may occur to obtain relevant information for the model.
Subsequently, the machine learning model is invoked to perform inference, applying learned
patterns to classify the website as either legitimate or potentially a phishing site. The resulting
prediction is communicated back to the user through the website's interface, enabling users to
take appropriate actions based on the model's assessment. The predicted result by the ML model
is provided as output to the user. If the result is -1 the output is given as “Website is not safe
to use”. Otherwise, the output is displayed as “Website is safe to use”. Optionally, the website
may incorporate a feedback mechanism, providing users with the opportunity to contribute
input on the accuracy of predictions, thereby facilitating continuous improvement and
retraining of the machine learning model. This user-friendly and efficient process allows for
quick assessments of a website's legitimacy based on the insights generated by the deployed
machine learning model.
TESTING
7. TESTING
The main aim of the testing is to analyze the performance and to evaluate the errors that
occur when the program is executed with different input sources and running in different
operating environments.
In this project, we have developed a GUI and a Machine Learning code which helps in
detecting Website URL’s and predicting them phished or not. The main aim of testing this
project is to check if the URL is being predicted accurately and check the working performance
when different URLs are given as inputs.
The testing steps are:
a. Unit Testing
b. Integration Testing
c. Validation Testing
d. User Acceptance Testing
e. Output testing
engineering, verification and validation (V&V) is the process of checking that a software
system meets specifications and that it fulfills its intended purpose. It may also be referred to as
software quality control.
Test Case 1
Input https://www.google.com
Parameters are extracted and values are
Expected Output assigned
Parameters are extracted and values are
Actual Output assigned
Result Successful
Test Case 1
Input https://www.google.com
Result Successful
Test Case 2
Input https://Ieeexplore.ieee.org
Result Successful
Test Case 3
Input http://123.456.789.123/amazon.com/
Result Successful
Test Case 4
Input http://123.456.789.123/paypal.com/
Result Successful
7.3 SCREENSHOTS
8.1 CONCLUSION
It is found that phishing attacks is very crucial and it is important for us to get a
mechanism to detect it. As very important and personal information of the user can be leaked
through phishing websites, it becomes more critical to take care of this issue. This problem can
be easily solved by using any of the machine learning algorithm with the classifier. We already
have classifiers which gives good prediction rate of the phishing besides, but after our survey
that it will be better to use a hybrid approach for the prediction and further improve the accuracy
prediction rate of phishing websites. We have seen that existing system gives less accuracy so
we proposed a new phishing method that employs URL based features and also, we generated
classifiers through several machine learning. We have got the desired results of testing the site
is phishing or not by using five different classifiers.
Further work can be done to enhance the model by using ensembling models to get greater
accuracy score. Ensemble methods is a ML technique that combines many base models to
generate an optimal predictive model. Further reaching future work would be combining
multiple classifiers, trained on different aspects of the same training set, into a single classifier
that may provide a more robust prediction than any of the single classifiers on their own. The
project can also include other variants of phishing like smishing, vishing, etc. to complete the
system. Looking even further out, the methodology needs to be evaluated on how it might
handle collection growth. The collections will ideally grow incrementally over time so there
will need to be a way to apply a classifier incrementally to the new data, but also potentially
have this classifier receive feedback that might modify it over time.
REFERENCES
[1] J. Rashid, T. Mahmood, M. W. Nisar and T. Nazir, "Phishing Detection Using Machine
Learning Technique," 2020 First International Conference of Smart Systems and Emerging
Technologies (SMARTTECH), 2020, pp. 43-46.
[2] M. H. Alkawaz, S. J. Steven and A. I. Hajamydeen, "Detecting Phishing Website Using
Machine Learning," 2020 16th IEEE International Colloquium on Signal Processing & Its
Applications (CSPA), 2020, pp. 111-114.
[3] V. Patil, P. Thakkar, C. Shah, T. Bhat and S. P. Godse, "Detection and Prevention of Phishing
Websites Using Machine Learning Approach," 2018 Fourth International Conference on
Computing Communication Control and Automation (ICCUBEA), 2018, pp. 1-5.
[4] W. Bai, "Phishing Website Detection Based on Machine Learning Algorithm," 2020
International Conference on Computing and Data Science (CDS), 2020, pp. 293-298.
[5] A. Razaque, M. B. H. Frej, D. Sabyrov, A. Shaikhyn, F. Amsaad and A. Oun, "Detection of
Phishing Websites using Machine Learning," 2020 IEEE Cloud Summit, 2020, pp. 103-107.
[6] M. M. Vilas, K. P. Ghansham, S. P. Jaypralash and P. Shila, "Detection of Phishing Website
Using Machine Learning Approach," 2019 4th International Conference on Electrical,
Electronics, Communication, Computer Technologies and Optimization Techniques
(ICEECCOT), 2019, pp. 384-389.
[7] A. Alswailem, B. Alabdullah, N. Alrumayh and A. Alsedrani, "Detecting Phishing Websites
Using Machine Learning," 2019 2nd International Conference on Computer Applications &
Information Security (ICCAIS), 2019, pp. 1-6.
[8] Yuan, H., Chen, X., Li, Y., Yang, Z., & Liu, “Detecting Phishing Websites and Targets Based
on URLs and Webpage Links,” 2018 24th International Conference on Pattern Recognition
, 2018, pp. 3669-3674.
[9] SHENG, Steve; WARDMAN, Brad; WARNER, Gary; CRANOR, Lorrie; HONG, Jason;
[11] GUARNIERI, Claudio. The Year of the Phish [online]. Nex [visited on 2020-04-12].
Available from: https://nex.sx/blog/2019/ 12/15/the-year-of-the-phish.html.
[12] Phishing Activity Trends Report [online].
[13] Uniform Resource Identifier (URI): Generic Syntax [online]. IETF Available from:
https://tools.ietf.org/html/rfc3986
[14] KOZA, John R.; BENNETT, Forrest H.; ANDRE, David; KEANE, Martin A. Automated
Design of Both the Topology and Sizing of Analog Electrical Circuits Using Genetic
Programming. In: Artificial Intelligence in Design ’96. 1996, pp. 151–170. ISBN 978-94-
009-0279-4. Available also from: https://doi.org/10.1007/978- 94-009-0279-4_9.
[15] R Kiruthiga and D. Akila ,“Phishing Website Detection Using Machine Learning”, 2022.
[16] GERÓN, Aurelién. Hands-On Machine Learning with ScikitLearn, Keras, and
TensorFlow. In: O’Reilly Media, Inc., 2017, chap. 1. ISBN 978-1-49-203264-9.
[17] Transport Layer Security (TLS) Extensions [online]. IETF [visited on 2020-04-18].
Available from: https://www.rfc-editor.org/ info/rfc3546.
[18] Lizhen Tang and Qusay H. Mahmoud, “A Survey of Machine Learning Based Solutions
for Phishing website Detection”, 2021
[19] COX, Nicholas; JONES, Kelvyn. Exploratory data analysis. Quantitative Geography,
London: Routledge. 1981, pp. 135–143.
[20] HUCKA, Michael. Nostril: A nonsense string evaluator written in Python. Journal of Open
Source Software. 2018, vol. 3, no. 25, pp. 596. Available from DOI: 10.21105/joss.00596.
[21] CLAESEN, Marc; DE MOOR, Bart. Hyperparameter Search in Machine Learning. 2015.