0% found this document useful (0 votes)
2 views6 pages

comparative study

Comparative study

Uploaded by

R GAYATHRI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views6 pages

comparative study

Comparative study

Uploaded by

R GAYATHRI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Comparative study on Malware Detection using

Machine Learning Techniques


Dr. Gayetri Devi.S.V, R.Gayathri, S.Shahana Fareen , P.Muni Raja Chandra,
Associate Professor, Assistant professor, Assistant professor, Assistant professor,
Department of AI&DS, Department of IT, Department of AI&DS, Department of Mechanical,
Aalim Muhammed Salegh Aalim Muhammed Salegh Aalim Muhammed Salegh Aalim Muhammed Salegh
College of Engineering, College of Engineering, College of Engineering, College of Engineering,
gayetri.venkhatraman@gmail.c r.gayathri@aalimec.ac.in shahanafareen@aalimec.ac.in p.munirajachandra@aalimec.ac
om .in

Abstract— One of the Threatening dynamic issues, that the great challenge for securing the data which are used for
internet faces, is malware. It is also known as a threat for cyber personal as well as for professional uses. Various businesses,
security which mainly targets computer systems that are enriched universities, governments as well as individuals are
with information technologies. This results in the detection of becoming prone to these cyber attacks worldwide in which
malware which helps in eradicating major difficulties such as the the confidential data possessed by them are being theft by
model’s accuracy performance, sudden malware attacks, and cybercriminals [4]. For securing the data by detecting and
analysis of malware. There are various Artificial intelligence (AI) destroying malware and to eradicate Cyber attacks, various
techniques that are employed for detecting malware very easily. machine learning algorithms are utilized. There are so many
Out of these Artificial Intelligence techniques malware detection
research articles that are published for identifying malware
is keenly detected with the help of machine learning algorithms.
In this proposed study, one of the advanced machine learning
by employing various machine learning algorithms. Some of
algorithms is used for detecting malware and it is known as X- the early detection processes for identifying malware by
Adapt Boost which is a combination of extreme randomized tree using traditional methods are as follows.
with XG boost. To evaluate the performance of this model, it is In 1966, a mathematician named John von Newman
compared with existing studies such as SVM, KNN, RF, and XG developed a program that has the capacity to rebuild itself
boost in terms of accuracy, precision, recall, and Score. The throughout the entire system, which is known as theoretical
combination of proposed classifier in addition with excellent pre-
malware. It is the first malware that is detected and it is a
processing and segmentation technique helps to improve
manual detection of malware [5]. Followed by this after 5
accuracy of the model as well as improve the percentage of other
metrics. As a result, it is found that the proposed model has
years in 1971 Bob Thomas created a program called Creeper,
achieved the highest accuracy of about 97.82%precision of about which is designed to move in between various computers. It
96.60 %, recall of about 97.00%, and F-1 score of about 96.80%. can get into a system through Which can copy itself From
The values that are obtained from the evaluation metrics are one another, this is the first computer That has been
higher than the other existing models. Therefore, cyber security established in the year by Bob Thomas. The method used for
is enriched with the help of the proposed classifier which detects the detection is a manual process. That can be replicated on
malware precisely and enhances information technology. the ARPANET network. After this, a 15-year-old inventor
named Rich Screnta developed a program that is a practical
Keywords— Machine Learning, Artificial Intelligent, joke known as the Elk Cloner virus. While altering the
Malware Detection, Empirical Soft Mode Filtering, Hybrid Relief- software of a disc that the inventor couldn't access properly,
GA, X-Adaptboost he invented a virus for computers for the very first time. This
virus spreads once the infected floppy disc is transferred and
I. INTRODUCTION copies itself throughout the entire computer memory storage.
The method used for detecting this type of virus is signature-
Malicious components and their threats are becoming
based detection and this is the virus that can spread to the
highly common as fast as the internet develops [1].In modern
core in a wild way. After this in 1986, a Pakistani medical
technology, the most important attack that has to be
software distributor namely Amjad along with the author's
concerned is the Cyber attack, which means it exploits the
brothers Basat, and Farooq Alvi the first virus for a personal
entire computer system through some malicious
computer. This virus copies the software, in simple it is the
processes. It may even steal change or even destroy the
copied version of the personal computer software. The
entire system. Malware is one of the examples of this Cyber
method involved in the detection of this kind of brain virus is
attack. Generally, Malware is known as a program that is
signature-based detection. Later in 1988, an MIT student
designed to harm a user, business, or the entire computer
named Robert Morris invented the Morris virus which is a
system [2]. In simple, all kinds of computer threats are
malware that can be anticipated effectively. It can copy by
actually known as malware. It can affect the files or even it
itself and can repeatedly copy throughout the infected
can stand alone in destructing the entire computer system.
computers. This threatened the whole world as it was one of
Malware is also now arising as a common and complicated
the most widespread cyber attacks at that time and the
threat to the security of websites that are very modern [3].
inventor of this war was the first cyber criminal in the United
These Cyber attacks are happening mainly because of the
States. The detection method involved in detecting the
usage of computer systems and the usage of internet
Morris warm is named, signature and anomaly-based
connection which have been highly demanded. The malware
detection. These are some of the earliest invasions of viruses
is classified into worms, backdoors, spyware, root kits, etc...
and their detection methods. This proposed study aims to
These have caused major threats for smart devices, androids,
detect the malware very effectively by the use of Noise
and other computer software, which results in becoming a

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


Reduction using the Empirical Soft Mode Filtering(esmf) III. METHODOLOGY
pre-processing technique along with relief weight age pca The key steps that are followed in the proposed study and
mentality reduction (Hybrid Relief-GA) feature selection the overall work flow is described below in the figure 1 by
method in addition to X-Adapt Boost(combination of the using a schematic diagram.
extremely randomized tree with xg-boost)classification
model. Further, the proposed classification model is also
compared with the other existing classification models such
as SVM, KNN, RF, and Xg-boost to verify the performances
of the models making the study a comparative analysis.

II. RELATED WORKS


In 2022, Al-Naji et al. [6] compared a traditional
classifier with Euclidean support sector machine augmented
classification to outperform the proposed methods. This
authenticated the environment by using a block chain. The
suggested solution helps in eradicating various types of
threats by conducting Gauge tests interpretation with the cab-
IoT method and achieving the highest accuracy. Though the
model achieved the highest accuracy, it is limited due to
potential biasing, high computational cost, and lack of
interpretability. Sarhan et al. in 2022, employed, a
decentralized block chain platform, which is a learning
system that enables the securest form of malware direction
for protecting privacy by using collaborative IoT. It helps in
protecting and maintaining the proposed networks with the
highest amount of efficiency and ensures a secure machine
learning-based cyber security system. Though this machine
learning-based secure network helps in facing security risks,
it couldn't give accurate classes and it is also limited due to
the lack of CTI between the proposed enterprises [7].
Figure 1:Schematic diagram of overall workflow
Gomez A. and Munoz A. 2023 used a deep learning-
based classification for Android devices in the detection of A. Dataset Collection
malware. This model is well established and validated by The datasets that are required for malware detection are
using data sets such as CICMalDroid2020, obtained from a publicly available dataset source. The key
CICMalDroid2017, and CICAndMal2017 and achieves an features of the data set include some of the permission
effective F-1 score and false positive rate, which is very features, system features, security-related features,
high. Though the model worked very effectively in detecting communication features, data access features, device control
the attacks that are in Android devices, the computational features, and miscellaneous features that are becoming the
complexity comprised in the DL-based model is limited in system's UI components. It provides a rich source of
various cases making it fail extremely [8]. Manzil and information for researchers in developing and evaluating
Manaohar Naik employed an Android-based malware malware detection techniques which further helps in
category detection which is used in enhancing the novel improving cyber security. A total of 51 distinct malware
feature employed in vector-based machine learning model families are present in the samples. Over 17,394 data points
[9]. This approach is validated by using both machine and from different locations were included; the dataset had 279
deep learning methods. This approach shows in improving columns and 17,394 rows. Figure 2 represented below shows
the efficiency of the used model and evaluated by using the the dataset that is obtained for malware detection and
CIC Mal Droid data set with the highest accuracy. Though it classification.
helped in detecting the model with the greatest accuracy, the
security framework has to be improved much more for
effective processing. The study Proposed by Poornima et al.
in 2024 [10] by using deep belief network mad-net
accurately detects malware attacks and enhances the security
more accurately. The data that is extracted from the dataset is
further used for the feature extraction process. This classifier
detects the malware very accurately and achieves the highest Figure 2: Sample of selected dataset
accuracy which prevents the android from the malware
attacks. Though it is good at predicting attacks, it has failed B. Pre-processing
due to limited interpretation and computational complexity. Pre-processing in terms of malware detection is defined
as the data preparation require for machine learning models
by transforming, restructuring and cleaning the models
which results in improving the accuracy of the identification
of malwares. In this proposed studyEmpirical Soft Mode
Filtering(esmf) is used as preprocessing technique in
detecting malwares. Empirical soft mode filtering is one of detecting malware which enhances the precise filtering.
the advanced techniques that are used for processing signals Figure 3 which is represented below shows the pre-processed
which can be easily adapted for reducing unwanted features dataset.
and patterns that are in different domains which includes in
improving the cyber security. By applying ESMF as a
preprocessing step for detecting malwares, the scientists
expect the highest classification accuracy as the filtering
technique used in the study remove the unwanted patterns
from the dataset effectively.
The Empirical Soft Mode Filtering helps in filtering the
unnecessary features that are acquired in the malware
datasets. This helps in improving the performance of the Figure 3: Pre-processing dataset
machine learning based detection algorithms. Further the
quality of the features that are extracted is enhanced by C. Feature Selection
eliminating the data which are misleading the entire model. After, the data that are obtained from the pre-processed
In Malware detection the obtained datasets are decomposed filtering, the obtained datasets are now ready for the next
into empirical structures. Then the high frequency patterns crucial step in malware detection which is named feature
and are removed precisely. These are achieved by using selection. This is crucial because it helps in improving the
statistical approaches. After that, the filtered data is re- classification accuracy, reduces computational costs, and
structured and the information which is very important is also helps in removing the irrelevant features that are
highly preserved. Finally, the restructured dataset is used for presented in the dataset. In this proposed study Relief weight
further feature extraction which is very helpful in classifying age+ PCA dimensionality reduction (Hybrid Relief-GA) is
the machine learning models that are used. This Empirical used for enhanced features selection. This approach works by
Soft Mode Filter helps in enhancing the signal-to-noise ratio combining the reduction techniques of the dimensions with
as it helps in removing the unwanted patterns, reduces over the feature selection for enhanced classification performance.
fitting and improves the robustness which results in
increasing the accuracy. Thus the ESMF acts as a significant The feature selection process is carried out by the relief
preprocessing technique for detecting malwares which algorithm aids in weight scores and helps in distinguishing
enhances the precise filtering. The figure 3 that is the malware from benign dataset samples which also helps in
represented below shows the pre-processed dataset. filtering the irrelevant patterns. Next, the genetic algorithm
optimizes the subset of the features by selecting the
Pre-processing in terms of malware detection is defined crossover, fitness, and mutation through the machine
as the data preparation required for machine learning models learning classifier. Once the feature set is detected, the
by transforming, restructuring, and cleaning the models principal component analysis (PCA) utilized is applied to
which result in improving the accuracy of the identification reduce the dimensions and correlate the features into the
of malware. In this proposed study empirical Soft Mode principal components that further preserve the variances.
Filtering(esmf) is used as a preprocessing technique in Therefore, the model identifies the malware more accurately
detecting malware. Empirical soft mode filtering is one of by combining relief for scoring, GA for smart selection, and
the advanced techniques that are used for processing signals PCA for data compacting. Thus the hybrid approach helps in
which can be easily adapted for reducing unwanted features improving the classification accuracy and aids in reducing
and patterns that are in different domains which includes the complexity of the computers, which further enhances the
improving cyber security. By applying ESMF as a model by making it very effective for malware detection.
preprocessing step for detecting malware, the scientists Figure 4 represented below shows the top 15 features that are
expect the highest classification accuracy as the filtering extracted by using the Hybrid Relief-GA feature selection
technique used in the study removes the unwanted patterns technique.
from the dataset effectively.
The Empirical Soft Mode Filtering helps in filtering the
unnecessary features that are acquired in the malware
datasets. This helps in improving the performance of the
machine learning-based detection algorithms. Further, the
quality of the features that are extracted is enhanced by
eliminating the data that are misleading the entire model. In
Malware detection the obtained datasets are decomposed into
empirical structures. Then the high-frequency patterns are
removed precisely. These are achieved by using statistical
approaches. After that, the filtered data is re-structured and
the information that is very important is highly preserved.
Finally, the restructured dataset is used for further feature
extraction which is very helpful in classifying the machine
learning models that are used. This Empirical Soft Mode Figure 4: Feature selection result
Filter helps enhance the signal-to-noise ratio as it helps
remove unwanted patterns, reduces overfitting, and improves D. Classification
robustness which results in increasing accuracy. Thus the
After the feature selection, the malwares are exactly
ESMF acts as a significant preprocessing technique for
predicted and classified by using X-AdaptBoost
(combination of extreme randomized tree with xgboost) to scaling and noise, class imbalance, and reduction in
classification model and the performance of the proposed dimension along with feature selection.
classification model is compared with the other existing
classification models such as SVM, KNN, RF and XGboost RF: RF which is popularly known as random forest is
for enhanced the results. also a widely used classification model that is used for
detecting malware very effectively. Its ability to handle high-
X-Adapt Boost (combination of extreme randomized dimensional data prevents over fitting. The random forest
tree with xgboost): X-adapt boost is the advanced hybrid consist wide number of decision trees in which each tree is
approach that is used for classification, which works by trained by a random subset which is comprised of features
combining extremely randomized trees and extra gradient and samples. The final prediction which results in the final
boosting to enhance the accuracy and efficiency of the classification of the malware deduction by using random
detection of malware. This model boosts the strength of forest is based on the highest number of votes resulting in
learning techniques, which results in improving the machine detecting malware with improved classification accuracy and
learning for handling high dimensional datasets of malware improved amount of robustness. The random forest is
and further reduces the over fitting causes effectively. Extra specifically used in detecting the patterns of malware, which
trees, which are known as extreme randomized trees that are helps in capturing complex features, which are very helpful
presented in the model achieve randomness in selecting the in resisting noises. Though the random forest classification
features by using splitting nodes along with thresholds which model detects complex patterns keenly, it has some
create diverse decision trees. This results in enhancing the drawbacks such as less interpretation, the highest
generalization of the model and reduces the over fitting. The computational cost, and difficulty in potential biasing makes
xg-boost presented in the model helps in refining the it limited in several cases.
predictions that are obtained from gradient boosting which
further optimizes the weak classifiers into strong classifiers XG-boost: Extreme gradient boosting is high efficient
by applying L1&L2 which acts as a regularization approach and highly scalable classification model, which is used
that further helps in preventing over fitting. Then the especially in machine learning techniques for detecting
adaptive boosting approach utilizes misclassified samples of malware. Its ability to handle high dimensions with a large
the malware and ensures in effective correcting of the number of datasets results in attaining the highest accuracy.
subsequent irrelevant features. This hybrid combination It works by connecting the weak decision trees with new
helps in boosting feature selection, resistance of noise, fastest trees which results in correcting the errors of the before used
randomized trees training, and increased computational trees. By using the gradient boosting method and L1 And L2
efficiency which makes the adaptive boost perform very regularization, the risk of over fitting and misclassification is
higher with accurate malware detection. Overall, the hybrid eradicated. Its effectiveness in selecting the most important
classification model is well suitable for larger high- feature makes it peculiar for detecting the complicated
dimensional datasets. Thus the proposed classification patterns of the malware. Though the extreme gradient
models help in detecting the malwares with the highest boosting works efficiently, some of the limitations such as
accuracy which results in promoting cyber security and high computational cost and hyper plane interpretation make
reducing the threat. it limited in various cases, especially in detecting malware.

SVM: The support vector machine is a popularly used As a result of this comparison, the proposed X-Adapt
classification model for detecting malware. Its ability to Boost which is a combination of an extremely randomized
handle high dimensional data and efficiency in separating tree with xg boost model with the existing models such as
benign from malignant malware makes it worth. It works by SVM, KNN, RF, and XG boost proved that the proposed had
finding the hyper plane, which maximizes the margin and worked very effectively and has overcome all the other
distinguishes the classes which ensure the generalization of drawbacks that are obtained in other existing models. This
unseen malware. It is very helpful in detecting specific proved that the malware deduction is achieved with the
malware, which may be complex along with the polymorphic highest amount of classification accuracy only in the
malware, by using the radial basis function. Though the proposed classification model.
support vector machine detects malware effectively, it is
limited due to largest computational costs, imbalanced IV. RESULTS AND DISCUSSION
dataset, and difficulty in parameter tuning. In this proposed study, the malware is detected and classified
KNN: KNN which is widely known as k- nearest with the help of four main steps. The malware that has to be
neighbours is a classification model which is very simple but predicted is obtained through a publicly available dataset
also very effective malware detection technique. It works by source. Then the obtained dataset is denoised and pre-
relying on distance-based characteristics for classifying processed with the help of the pre-processing technique
benign and malware files. This classification model works namely Empirical Soft Mode Filtering (ESMF). By using
effectively by analyzing the distance between the closest this filtering technique, deep composition, reconstruction,
neighbours to a sample for enhancing the feature space by and soft mode filtering are enhanced and achieved. After
assigning a common class between them. Especially, this k- this, the pre-processed dataset is outsourced for feature
nearest neighbours are used for detecting the nonparametric selection. Hybrid Relief GA is used for selecting the features.
nature of the malware which does not assume any specific By employing this feature selection technique,
data, making it easily adaptable for various types of malware. dimensionality reduction along with feature scoring is
Though the model has been one of the popular classification achieved. After the feature selection process, the malware is
techniques in detecting malware, it is limited due to its non- highly classified and identified with the help of the proposed
parametric nature, high computational cost, high sensitivity significant classification technique, which is named x-Adapt
Boost. In this classification technique, the malware is
dynamically eradicated and classified as benign and
malware. Its adaptive boosting mechanism makes the comparison with other classifiers such as SVM has obtained
proposed classifier incredibly identify and classify the accuracy, precision and recall, and F1 score of87.21%,
malware. Thus the proposed x-adapt boost classification 85.50% and 86.10%, and 85.80%; KNN with accuracy,
algorithm serves as a hybrid approach for classifying and precision and recall, and F1 score of 91.07%, 89.30% and
identifying the malware that are attacking the cyber-security. 88.90%, and 89.10%; RF with accuracy, precision and recall,
Moreover, this proposed classifier helps in decreasing the and F1 score of 92.27%, 91.20% and 92%, and91.60% and
discomfort that occurs while using the software and enhances XG-Boost with accuracy, precision and recall, and F1 score
the security system overall. The performance of the proposed of 94.87%, 93.75% and 94.10%, and 93.92%.
classifier is also compared with the other existing algorithms
such as SVM, KNN, RF, and XG-Boost, and further verified Additionally, the proposed classification algorithm used
using the evaluation metrics namely accuracy, precision, in the study predicts and classifies the threat type that affects
recall, and F1 score with the help of the graphs given in the the softwares. The pie-chart which is given in the below
below figures 5, 6 &7. figure 8, the type of the software threat is classified and
differentiated into two types such as benign and malware
which is in the ratio of 1:1 (50/50).

Figure 5: Accuracy comparison

Figure 8: Prediction result


Furthermore, the figure 9 represented below, gives
detailed overview of the classification results on malware
detection. This indicates that, the proposed classifier
performs best in detecting the software threats such as benign
and malware through confusion matrix.

Figure 6: Precision & Recall comparison

Figure 7: F1 score comparison Figure 9: Confusion matrix


From the above figures 6, 7 and 8 the performance of the
proposed algorithm is compared with the performance of the V CONCLUSION
other existing algorithms. Moreover, their performances Thus, the proposed study demonstrates, that the paper has
verified with the help of the evaluation metrics namely shown a great work of a machine learning algorithm, which
accuracy, precision, recall and F1-Score. From the serves as a hybrid approach. This paper has presented a
verification it is known the proposed -adapt boost secured mechanism to detect malware and promote the
classification algorithm has functioned very effectively as it entire cyber security. The hybrid approach which is known
has achieved highestaccuracy, precision and recall, and F1 as x-Adapt Boost is also verified and compared with the
score of 97.82%, 96.60% and 97%, and 96.80% in
other existing classifiers such as SVM, KNN, RF, and XG- [4] D. Chandrakala, A. Sait, J. Kiruthika and R. Nivetha,
boost. The comparison has proved that the proposed x- "Detection and Classification of Malware," 2021
Adapt Boost has performed very well and has achieved the International Conference on Advancements in
accuracy, precision, recall, and F1-score to the highest than Electrical, Electronics, Communication, Computing
the other existing algorithms. Further, the proposed and Automation (ICAECA), Coimbatore, India,
classification algorithm predicts the software threat type 2021, pp. 1-3, doi:
such as benign or malware, which further helps in predicting 10.1109/ICAECA52838.2021.9675792.
the threats affecting the computer systems precisely. The [5] John Von Newman, “Theory of self reproducing
malware that is affecting the entire computer system, which automata,” University of Illinois Press Urbana and
leads to harmful cyber trafficking is well eradicated with the London, 1966.
use of the proposed classification algorithm. Overall, the [6] F.H. Al-Naji, R. Zagrouba,“CAB-IoT: continuous
study that is proposed enhances the entire cyber security by authentication architecture based on Blockchain for
eradicating malicious software threats which improve the internet of things,” Journal of King Saud University-
safe and secure system works. Computer and Information Sciences, pp. 2497-2514,
November 2020.
REFERENCES [7] M. Sarhan, W.W. Lo, S. Layeghy, M. Portmann,
[1] D. Gavriluţ, M. Cimpoesu, D. Anton, L. Ciortuz, “HBFL: ahierarchical blockchain-based federated
“Malware detection using machine learning,” learning framework for collaborative IoT intrusion
International Multiconference on Computer Science detection,” Comput. Electr. Eng., vol. 103, 2022.
and Information Technology, pp. 735–741, Oct. [8] A. Gómez, A. Muñoz, “Deep learning-based attack
2009, doi:10.1109/IMCSIT.2009.5352759. detection and classification in android devices,”
[2] U.V. Nikam, V.M.Deshmuh, “Performance Electronics, vol. 12, no. 15, 2023.
evaluation of machine learning classifiers in malware [9] H.H.R. Manzil, S. ManoharNaik, “Android malware
detection,” IEEE International Conference on category detection using a novel feature vector-based
Distributed Computing and Electrical Circuits and machine learning model,”Cybersecurity, 6 (1) (2023).
Electronics (ICDCECE), pp. 1-5, April 2022. [10] S. Poornima, R. Mahalakshmi, “Automated malware
[3] K. Zhao, D. Zhang, X. Su and W. Li, "Fest: A feature detection using machine learning and deep learning
extraction and selection tool for Android malware approaches for android applications,” Measurement
detection," 2015 IEEE Symposium on Computers and Sensors, vol. 32, no. 4, 2023.
Communication (ISCC), Larnaca, Cyprus, 2015, pp. doi:10.1016/j.measen.2023.100955.
714-720, doi: 10.1109/ISCC.2015.7405598.

You might also like