0% found this document useful (0 votes)
9 views2 pages

ML Malicious URL

The document discusses the application of machine learning algorithms in detecting malicious URLs, highlighting a project aimed at achieving 98% accuracy. The author gathered a dataset of approximately 400,000 URLs, including 80,000 malicious ones, and plans to use Logistic Regression for analysis. The initial step involved creating a custom tokenizer to process the unique structure of URLs.

Uploaded by

davychuinz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views2 pages

ML Malicious URL

The document discusses the application of machine learning algorithms in detecting malicious URLs, highlighting a project aimed at achieving 98% accuracy. The author gathered a dataset of approximately 400,000 URLs, including 80,000 malicious ones, and plans to use Logistic Regression for analysis. The initial step involved creating a custom tokenizer to process the unique structure of URLs.

Uploaded by

davychuinz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

With the growth of Machine Learning in the past few years, many tasks are being done with

the help of machine learning algorithms.Unfortunately or fortunately, there has been little
work done on machine learning and cyber security. So I thought of presenting some
at Fsecurify.

A few days ago, I had this idea about what if we could detect a malicious URL from a non-
malicious URL using some machine learning algorithm. There has been some research done
on the topic so I thought that I should give it a go and implement something from scratch.
So lets start.

Machine Learning and Security | Using Machine Learning


to detect Malicious URLs with 98% accuracy
Gathering Data

The first task was gathering data. I did some surfing and found some websites offering
malicious links. I set up a little crawler and crawled a lot of malicious links from various
websites. The next task was finding clear URLs. Fortunately, I did not have to crawl any.
There was a data set available. Don’t worry if I am not mentioning the sources of the data.
You’ll get the data at the end of this post.

So, I gathered around 400,000 URLs out of which around 80,000 were malicious and others
were clean. There we have it, our data set. Lets move next.

Analysis

We’ll be using Logistic Regression since it is fast. The first part was tokenizing the URLs. I
wrote my own tokenizer function for this since URLs are not like some other document text.
Some of the tokens we get are like ‘virus’,’exe’,’php’,’wp’,’dat’ etc.

You might also like