With the growth of Machine Learning in the past few years, many tasks are being done with
the help of machine learning algorithms.Unfortunately or fortunately, there has been little
work done on machine learning and cyber security. So I thought of presenting some
at Fsecurify.
A few days ago, I had this idea about what if we could detect a malicious URL from a non-
malicious URL using some machine learning algorithm. There has been some research done
on the topic so I thought that I should give it a go and implement something from scratch.
So lets start.
Machine Learning and Security | Using Machine Learning
to detect Malicious URLs with 98% accuracy
Gathering Data
The first task was gathering data. I did some surfing and found some websites offering
malicious links. I set up a little crawler and crawled a lot of malicious links from various
websites. The next task was finding clear URLs. Fortunately, I did not have to crawl any.
There was a data set available. Don’t worry if I am not mentioning the sources of the data.
You’ll get the data at the end of this post.
So, I gathered around 400,000 URLs out of which around 80,000 were malicious and others
were clean. There we have it, our data set. Lets move next.
Analysis
We’ll be using Logistic Regression since it is fast. The first part was tokenizing the URLs. I
wrote my own tokenizer function for this since URLs are not like some other document text.
Some of the tokens we get are like ‘virus’,’exe’,’php’,’wp’,’dat’ etc.