0% found this document useful (0 votes)
6 views8 pages

MalwareDelection ML J1

The paper discusses the increasing threat of malware and proposes a model for its detection and classification using machine learning algorithms. It highlights the limitations of existing methods and emphasizes the need for improved techniques to combat advanced malware threats. The proposed model demonstrates high accuracy, particularly with the Random Forest algorithm, while acknowledging challenges such as dataset imbalance and the evolving nature of cyberattacks.

Uploaded by

skilmeapp2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

MalwareDelection ML J1

The paper discusses the increasing threat of malware and proposes a model for its detection and classification using machine learning algorithms. It highlights the limitations of existing methods and emphasizes the need for improved techniques to combat advanced malware threats. The proposed model demonstrates high accuracy, particularly with the Random Forest algorithm, while acknowledging challenges such as dataset imbalance and the evolving nature of cyberattacks.

Uploaded by

skilmeapp2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/371403249

Detection and Classification of Malware for Cyber Security using Machine


Learning Algorithms

Conference Paper · April 2023


DOI: 10.1109/ICONSTEM56934.2023.10142575

CITATIONS READS

7 1,329

2 authors:

Judy Subramanian Rashmita Khilar


Saveetha University 103 PUBLICATIONS 394 CITATIONS
3 PUBLICATIONS 9 CITATIONS
SEE PROFILE
SEE PROFILE

All content following this page was uploaded by Judy Subramanian on 27 March 2024.

The user has requested enhancement of the downloaded file.


Detection and Classification of Malware for Cyber Security using Machine Learning Algorithms

Judy S1, Rashmita Khilar2


Research Scholar1, Professor2
Department of Information Technology
Saveetha School of Engineering
Saveetha Institute of Medical and Technical Sciences, Saveetha University, Chennai, Tamil Nadu
judys9014.sse@saveetha.com1, rashmitakhilar.sse@saveetha.com2

Abstract:
The threat of malware to information security is one that keeps growing. However, the Windows
operating system faces a very high level of unintended security risk. System exploitation that is prohibited could
pose a security risk. For instance, PayPal is frequently imitated because hackers can profit significantly from
obtaining consumers' PayPal login information. The main drawbacks of existing system is that, it takes more time
to process and they are less efficient. To overcome the above drawbacks current research arena proposes a way
that businesses detect threats, adapt and implement numerous cybersecurity techniques in combination with
Machine learning and IOT approaches. But still there are lot of issues occurring in the above-mentioned
techniques i.e., the Signature based detection is unattainable. The conclusion stated was that no machine is able
to detect the malwares of the new generation with complete preciseness. A threat's mitigation is intended in
addition to its identification and prevention. This study gives an insight about the various detection and
classification techniques that were proposed using Machine Learning algorithms.

Keywords: Malware, Malware Detection, Cybersecurity, Machine Learning Algorithms

I. Introduction

Malware is a type of software code that is intended to cause harm to a computer network. Malware code
can manifest as viruses, worms, Trojan horses, or spyware. Malware detection seeks to locate and remove any
type of malware code from a network. A Cybersecurity Ventures report estimates that cybercrime will cost the
world $10.5 trillion per year by 2025, up from $3 trillion in 2015. This is the largest transfer of economic wealth
in history. Unavoidably, malware will enter the network. Defences that offer considerable visibility and breach
detection should always be present. Malicious actors must be swiftly located in order to remove malware. It calls
for ongoing network scanning. The malware needs to be taken off the network once the threat has been recognised.
Modern antivirus software is insufficient to defend against sophisticated online threats. The recent malware threats
and their effects are consolidated as below.
News Malware Attacks: COVID-19 related emails sent by hackers disseminate false information regarding the
pandemic.
Fleece ware: Despite removing those apps, Fleece ware still charges with a significant amount of money.
IoT Device Attack: Hackers are attempting to use devices like smart speakers and video doorbells to obtain
important information.
Social Engineering Attack: The hacker poses as a certain individual, queries about the victim's account are asked,
and the customer service staff is tricked into providing sensitive data.
Crypto jacking: Hackers are attempting to install Crypto jacking malware on computers and mobile devices to aid
in the mining process. Bitcoin has soared above $40,000; prices of crypto currencies will continue to surge through
2022.
Hackers will be able to exploit this technology to launch deadly cyberattacks as more tools are made
accessible to developers who wish to create AI scripts and software. While machine learning and artificial
intelligence are being used by cybersecurity firms to battle malware, similar technologies may also be widely used
to attack networks and devices. Cyberattacks may frequently be quite time and resource-intensive for hackers.
Therefore, it is only reasonable to assume that hackers will create highly-advanced and harmful AI-based malware
in 2022 and beyond as a result of the development of AI and machine learning technology.
Malware detection is extremely necessary and crucial because of how prevalent and dangerous it is, as
well as how easy it is for it to penetrate computers or assault its hosts. If malware is found, users will get and send
a warning message. Users will thus refrain from downloading unfamiliar files or exploring unsafe websites in
greater depth. In doing so, it will successfully prevent hackers from taking control of your device and stealing
your data.

II. Literature Survey


2.1 Attacks and Threats
An unauthorized action that occurs on the user’s side which is performed by a malware is said to be a
malware attack or cyberattack. The classification of malware is dealt in many ways. [9] The malware is classified
according to the structure, the first and the second-generation malwares. It was proposed that the there is no change
in the internal structure of the first-generation malwares. Whereas in the second-generation, the actions remain
the same but the structure changes. Further classification of the second-generation are: polymorphic, encrypted,
oligomorphic and metamorphic.
The detailed description of the types of threats is provided in the paper [7]. These threats occur at different
computational environment. This section covers few of these threats.
• Eavesdropping: This attack is also called by different names, sniffing and snooping attacks. It is
considered to be a passive attack. We have a silent listener or the adversary who listens to the
conversation of two parties.
• Traffic Analysis: In this type of attack, the conversation is not only over heard. They are intercepted and
examined in order to perform track the timing and location of the information.
• Replay Attack: An adversary retransmits the message captured from previous conversation. This is done
in order to misdirect recipients and makes the users to perform unnecessary tasks proposed by the
adversary.
• Malware Attack: This type of attack occurs when a malicious script or code is made to execute in the
victim’s computer. This malicious code performs various unauthorised activities such as information
robbing, illegal data encryption, data modification or deletion. The various types of malwares and their
description is briefed in Table 1.
Malware Type Description Variants Damage

Virus Self-replicates, modify system Resident, Polymorphic, System failure, data


codes and insert its own [10]. Macro, Rootkit, Creeper corruption, personal
and Slammer information stealth.

Worm Self-replicates without human Morris Worm, Bagle, Consume bandwidth, Overload
intervention and spreads through Mydoom, Ryuk and network, Open a backdoor and
LAN or Internet [9]. Conficker Steal data
Ransomware Ransome payment is demanded in Scareware, Screen Data and files are taken as
order to regain access [11]. lockers and Encrypting hostage until a ransom is paid.
ransomware
Botnet Sending massive request to Command and control, Account takeover, data theft
webserver, to avoid it serving the IRC, Telnet, Domains and web content scraping.
real users. and P2P

Table 1: Description of types of Malwares

Some of the attacks of cyberspace are mentioned by malicious URL, logging of keystrokes, fraud,
disabling the firewall and antivirus, malware, spam, phishing, and probing [13]. The author states that malware
and phishing are considered as the critical threats.

2.2 Malware Analysis


Though antivirus software, firewalls, and other similar strategies aid in the detection of malware on a
network,[1] it is necessary to develop a novel method of classifying malware and benign software owing to the
increasing of malware dissemination. Therefore, here is where machine learning as a unique technique to malware
categorization appears. Recurrent neural networks are recognised as the best strategy with the maximum accuracy
based on the results. Kamran Shaukat et.al [13], states that there are three sub-classes of malware detection, such
as: Statics, Dynamic and Hybrid detections. In static, the malicious patterns are examined without executing the
applications. In dynamic, in runtime the detections are performed. Hybrid is considered to be a combination of
static and dynamic.
A deep understanding about the two types of detection techniques of malware, the static and the dynamic
analysis is proposed [7]. This paper suggests that Static analysis source code is read without executing the program
files. The author [3] further states that static analysis is responsible for finding the behavioural attributes from the
file which is specified diagrammatically in the below figure. Figure 1.1 shows the various techniques of earlier
static analysis techniques. Opcode sequence and control flow graphs are considered to be obtained through static
analysis [17].
On the other hand, in dynamic analysis the file is monitored during execution and the running takes place
in the virtual environment. System calls, memory writes, API calls, registry changes and instruction traces are
stated as the few information’s obtained through dynamic analysis by the author [17]. The paper [18] states that
the various aspects of static and dynamic are combined to form the hybrid analysis. For training phase both the
static and dynamic analysis are used but in testing phase the tool performs only the static analysis.

2.3 Dataset Survey


Dataset is considered to be the most important criteria for any research. These datasets have to be genuine
and related to the research which is done. Though the methodology or the algorithm used for various research
may be the same, the dataset need not be the same. Nayeem Khan et.al.,[15] suggested the importance of the
quality of the dataset, as it is directly proportional to arrive at high accuracy. The dataset required was instances
of JavaScripts comprising of both legitimate and malicious. Sitalakshmi Venkatraman et.al.,[16] states that pre-
processing stage is responsible to employ an execution environment which retrieves raw messages. The below
table shows the dataset details such as the description of the dataset, the total number of datasets along with the
training and testing dataset and the source of the dataset, this was gathered for various research papers.

Research Paper Description Number of Dataset Dataset Source


Defending Malicious Script JavaScript of Total - 1924 leakiEst, School of Computer Science,
Attacks Using Machine benign and malware Benign - 1515 University of Birmingham, Birmingham,
Learning Classifiers Malicious 409 UK.
Image-Based Malware Malware images Grayscale images Malware Images.
Classification Using VGG19 created from 9389 <http://vision.ece.ucsb.edu/
Network and Spatial binaries of the Malware family 25 ~lakshman/malware_images/album/>
Convolutional Attention malware
Strong Baseline Defences Labeled subsets of Colour Images - CIFAR-10 dataset
Against Clean-Label 80 million tiny 60000 32x32
Poisoning Attacks images. Train - 50000
Testing - 10000
A survey of malware detection Dataset deleted Nil Columbia University Computer Science
techniques later due to privacy (CUCS) dataset
reasons
Deep Learning and Operation codes Samples - 2174 Kaggle’s Microsoft Malware Classification
Regularization Algorithms for obtained from ASM Challenge
Malicious Code Classification files
A Hybrid Deep Learning Binary files Samples - 75000 Microsoft Malware Classification Challenge
Image-Based Analysis for converted into (BIG, 2015), Malimg, VX Heavens
Effective Malware Detection image
representation
Zero-day Malware Detection Dataset uniquely Exe files - 66,703 honeynet project, VX heavens
based on Supervised Learning named with MD5 Malware - 51,223
Algorithms of API call hash value
Signatures
Malware Detection Using Traffic packets Sample - 900,000 Endgame Malware BEnchmark for Research
Honeypot and Machine captured by Malicious - 300,000 (EMBER)
Learning Honeypot used for Benign - 300,000
analysis unlabelled - 300,000
Table 2: Data Set Survey
III Proposed Model

Figure 1 depicts the malware detection and classification model. The dataset consists of 96477 malware
and 40654 clean ware files. The proposed model consists of the following phases: Pre-processor, Feature
Extractor, Classifier and the Output. These are first pre-processed using the pre-processor. The null values and
missing values are pre-processed. The next phase is the feature extraction. The dataset had 56 features initially
and through feature extraction module 13 features are selected. Five different algorithms are used for the Classifier
phase. This model also proposes the algorithm with better accuracy off the five algorithms along with the
classification of the legitimate and the malware files. For classification purpose the hash code of the files is used
by the classifier. The Random Forest algorithm is concluded to be the algorithm with the greatest accuracy
compared with the other four algorithms.

Figure 1: Proposed Classification Model

IV Experimental Results and Discussions

The proposed model is executed and the results are obtained. The result produced shows two things. The
first is the classification of the malicious and legitimate files using the five algorithms. The accuracy, precision
and other measures of the five algorithms is also obtained and the Random Forest algorithm is proposed as the
best of the five algorithms because of its accuracy. Initially the feature extraction model outputs the extracted 13
features. The figure 2 shows the histogram consisting of the selected 13 features. The performance measures of
all the algorithms are also obtained.

Figure 2: Extracted Features


Table 3 shows the performance measures such as precision, recall, f1-score and accuracy of the
algorithm. The output of the proposed model suggests that Random Forest algorithm gives a highest rate of
accuracy compared to the other algorithms used in this model.

Algorithm

Decision Tree Random AdaBoost Linear Regression XGBoost


Forest
Measure

Precision 98.82% 98.78% 98.20% 98.20% 98.20%

Recall 98.87% 99.33% 98.03% 98.03% 98.03%

F1 Score 98.85% 99.06% 98.12% 98.12% 98.91%

Accuracy 99.18% 99.45% 98.73% 53.65% 99.35%

Table 3: Performance measures of algorithms

The figure:3 shows the final classification report. The classification is made depending upon the values
segregated from the dataset. The final report predicts the number of malware files and the legitimate files across
the taken dataset. The time complexity of the model is also calculated as 90.72ms.

Figure 3: Classification Report

Limitations
The limitations are basically based upon the type of Cyber Attacks, Data Set and the Classifier algorithm [3]
This paper has two limitations, the classifier's output may be prejudiced and in the case of unbalanced data set
the accuracy might not be a good indicator of accurate performance.[5] The invulnerability of cyberattacks and
the cascade effects on other system nodes might be cited as the limits of this study when compared to previous
categorization algorithms for predictive analytics.[6] The research on various models' resistance to poisoning
assaults is presented in this article along with an analysis of the findings. From now on, only poisoning assaults
are covered by this research.
The proposed model shows a higher accuracy rate compared to the works proposed in reference paper mentioned
below (for instance [3] and [4]). Still, it has few limitations. This model works only for the data set with the md5
or the hash value. The time complexity for running the whole data set is calculated as 90.72ms, which can be
further reduced with the usage of best optimization algorithms.

Conclusion and Future Enhancements

Malware categorization is still an extremely difficult field to work in, despite the significant research and
astounding advancements made in recent years. But in this area, it is essential to improve on the current tactics
because attackers also come up with countermeasures to undermine defending tactics. Various malware
prevention tools are specified and also stated that no tool provides a full protection to your computer [7]. It is
suitable for detecting abnormalities in the current situations where new abnormalities are hidden in the network
and they are harmful to implementation [2]. [3] All binary and multi-classifiers exhibit higher accuracy in the
experiments; for binary classification, Decision-Trees obtain accuracy of 98.2 percent, and for multi-
classification, Random-Forests achieve accuracy of 95.8 percent. Because of the skewness of the data and the
rarity of the benign files in the original dataset, the precision and recall are, nevertheless, a little bit lower. The
study can be generalised for subsequent research by utilising a larger and more well-balanced dataset.
Among the four classifiers that are used, [4] it was concluded that the Random Forest classifier outperformed all
other models with an accuracy of 0.9927 percent.[5] Future studies will concentrate on using machine learning
(ML) approaches to different classification algorithms to learn the dataset for anomaly detection and to forecast
patterns in cyberattacks. [6] By using this method, it will be possible to forecast future trends and define the
optimum performance metrics for cyber threat intelligence. We can identify the cyber-security areas that employ
datasets and models from unreliable sources thanks to the findings and recommendations of this study.

When compared to the other detection methods, the model proposes an accuracy of 99.45% for Random Forest.
The above result can be enhanced by using the best optimization algorithms. However, for the existing research
work modifications can be done to obtain greater results.

References

1. B.A.S. Dilhara, Classification of Malware using Machine learning and Deep learning Techniques,
International Journal of Computer Applications · October 2021.
2. Ployphan Sornsuwit & Saichon Jaiyen, “A New Hybrid Machine Learning for Cybersecurity Threat Detection
Based on Adaptive Boosting”, Applied Artificial Intelligence, 2019, VOL. 33, NO. 5, 462–482.
3. Ihab Shhadat, Bara Bataineh, Amena Hayajneh, Ziad A. Al-Sharif “The use of Machine Learning Techniques
to Advance the Detection and Classification of unknown Malware”, International Workshop on Data-Driven
Security, April 6 – 9, 2020.
4. Harsha A K, Thyagaraja Murthy A, “Machine Learning Techniques for Malware Detection”, International
Journal of Scientific Research in Science, Engineering and Technology Print ISSN: 2395-1990, Sept – Oct,
2021
5. Abel Yeboah-Ofori, “Classification of Malware Attacks Using Machine Learning In Decision Tree”,
International Journal of Security (IJS), Volume (11) : Issue (2) : 2020
6. A P Chukhnov, Y S Ivanov, “Algorithms for Detecting and Preventing Attacks on Machine Learning Models
in Cyber-Security Problems”, Journal of Physics: Conference Series: Conf. Ser. 2096 012099, 2021
7. Mohammad Wazida, Ashok Kumar Dasb, Vinay Chamolac, Youngho Park, “Uniting cyber security and
machine learning: Advantages, challenges and future research”, ICT Express (2022)
8. Rabia Tahir, “A study on Malware and Malware Detection Techniques”, I.J. Education and Management
Engineering (IJEME), Vol.8, No.2, pp. 20-30, 2018. DOI: 10.5815 / ijeme.2018.02.03
9. Idika, A survey of malware detection techniques – Li, Jun, Detecting smart, self-propagating Internet Worms
10. Pushpendra Dwivedi, Hariom Sharan, “Analysis and Detection of Evolutionary Malware: A Review”,
International Journal of Computer Applications (0975 – 8887), Volume 174 – No. 20, February 2021.
11. Nwokedi Idika , Aditya P. Mathur ,“A Survey of Malware Detection Techniques”,
12. Mingfu Xue, Chengxiang Yuan, Heyi Wu, Yushu Zhang, And Weiqiang Liu, “Machine Learning Security:
Threats, Countermeasures, and Evaluations”, IEEE Access, Volume 8, April 2020.
13. Kamran Shaukat, Suhuai Luo, Vijay Varadharajan, Ibrahim A. Hameed, Shan Chen, Dongxi Liu, Jiaming Li,
“Performance Comparison and Current Challenges of Using Machine Learning Techniques in
Cybersecurity”, Energies 2020, 13, 2509; doi:10.3390/en13102509.
14. Arkajit Datta, Kakelli Anil Kumar, Aju. D, “An Emerging Malware Analysis Techniques and Tools: A
Comparative Analysis”, International Journal of Engineering Research & Technology (IJERT), ISSN: 2278-
0181, Vol. 10 Issue 04, April-2021.
15. Nayeem Khan, Johari Abdullah, and Adnan Shahid Khan, “Defending Malicious Script Attacks Using
Machine Learning Classifiers”, Wireless Communications and Mobile Computing Volume 2017, Article ID
5360472.
16. Sitalakshmi Venkatraman, Mamoun Alazab and Vinayakumar R, “A Hybrid Deep Learning Image-Based
Analysis for Effective Malware Detection”, Journal of Information Security and Applications, August 2019.
17. Anusha Damodaran, Fabio Di Troia, Visaggio Aaron Corrado, Thomas H. Austin, Mark Stamp, “A
Comparison of Static, Dynamic, and Hybrid Analysis for Malware Detection”.
18. M. Eskandari, Z. Khorshidpour, and S. Hashemi, HDM-Analyser: A hybrid analysis approach based on data
mining techniques for malware detection, Journal of Computer Virology and Hacking Techniques, 9(2):77–
93, 2013

View publication stats

You might also like