Effective Malware Detection Based On Behaviour and Data Features
Effective Malware Detection Based On Behaviour and Data Features
1 Introduction
Malware, or malicious software is a generic term that encompasses viruses, tro-
jans, spywares and other intrusive codes. They are spreading all over the world
through the Internet and are increasing day by day, thus becoming a serious
threat. According to the recent report from McAfee [1], one of the world’s leading
independent cybersecurity companies, there are more than 650 million malware
samples detected in Q1, 2017, in which more than 30 million ones are new. So
the detection of malware is of major concern to both the anti-malware industry
and researchers.
To protect legitimate users from these threats, anti-malware software prod-
ucts from different companies provide the major defence against malware, such
as Comodo, McAfee, Kaspersky, Kingsoft, and Symantec, wherein the signature-
based method is employed. However, this method can be easily evaded by mal-
ware writers through the evasion techniques such as packing, variable-renaming,
and polymorphism [2]. To overcome the limitation of the signature-based method,
heuristic-based approaches are proposed, aiming to identify the malicious be-
haviour patterns, through either static analysis or dynamic analysis. But the
increasing number of malware samples makes this method no longer effective.
Recently, various machine learning approaches like Support Vector Machine, De-
cision Tree and Naive Bayes have been proposed for detecting malware [3]. These
techniques rely on data sets that include several characteristic features for both
malware and benign software to build classification models to detect (unknown)
malware. Although these approaches can get a high accuracy (for the stationary
data sets), it is still not enough for malware detection. On one hand, most of
them focus on the behaviour features such as binary codes [4–6], opcodes [6–8]
and API calls [9–11], leaving the data information out of consideration. While a
fewl of them do consider the data information, they consider only simple features
like strings [12, 13] and file relations [14, 15]. On the other hand, as malware con-
tinues to evolve, some new and unseen malware have different behaviours and
features. Even the obfuscation techniques can make malware difficult to detect.
Hence, more datasets and experiments are still needed to keep the detection
effective.
In this paper, we propose an effective approach to detecting malware based
on machine learning. Different from most existing work, we take into account
not only the behaviour information but also the data information. Generally,
the behaviour information reflects what the software intends to behave, while
the data information indicates which datas the software intends to perform on
or how data are organised. Our approach tries to learn a classifier from existing
executables with known categories first, and then uses this classifier to detect
new, unseen executables. In detail, we take the opcodes, data types and system
libraries that are used in executables, which are collected through static analy-
sis, as representative features. As far as we know, our approach is the first one
to consider data types as features for malware detection. Moreover, in our im-
plementation, we employ various machine learning methods, such as K-Nearest
Neighbor, Native Bayes, Decision Tree, Random Forest, and Support Vector
Machine to train our classifier.
Several experiments are conducted to evaluate our approach. Firstly, we con-
ducted 10-fold cross validation experiments to see how well various machine
learning methods perform. The experimental results show that the classifier
trained by Random Forest performs best here, with the accuracy 0.9788 and
the AUC 0.9959. Secondly, we conducted experiments to illustrate that all the
features are effective for malware detection. The result also shows that in some
case using type information is better than using the other two. Thirdly, to test
our approach’s ability to detect genuinely new malware or new malware versions,
we ran a time split experiment: we used our classifier to detect the malware sam-
ples which are newer than the ones in our data set. Our classifier can detect 81%
of the fresh samples, which indicates that our classifier is capable of detecting
some fresh malware. The results also suggest that malware classifiers should be
updated often with new data or new features in order to maintain the classifica-
tion accuracy. Finally, one reason that makes malware detection difficult is that
malware writers can use obfuscation techniques to evade the detection. So for
that, we performed experiments to test our approach’s ability to detect new mal-
ware samples that are obtained by obfuscating the existing ones through some
obfuscation tools. All the obfuscated malware samples can be detected by our
classifier, demonstrating that our classifier has a resistance to some obfuscation
techniques.
The remainder of this paper is organised as follows. Section 2 presents the
related work. Section 3 describes our approach, and followed by the implemen-
tation and experimental results in Section 4. Finally, Section 5 concludes.
2 Related Work
There have been lots of work on Malware Detection. Here we focus on some
recent related work based on machine learning. Interested readers can refer to
the surveys [3, 16] for more details.
Over the past decade, intelligent malware detection systems have been de-
veloped by applying data mining and machine learning techniques. Some ap-
proaches employ the N-grams method to train a classifier based on the binary
code content[4–6]. Rather surprisingly, a 1-gram model can distinguish malware
from benign software quite well for some data set. But these approaches can
be evaded easily by control-flow obfuscation. Some approaches [6–8] take the
opcodes, which reflect the behaviours of the software of interest quite well, as
features to classify malware and benign software. Some other approaches con-
sider API calls, intents, permissions and commands as features [9–11], based on
different classifiers. All these works focus more on the behaviour information,
while our current work considers not only behaviour information but also data
information.
There have been some research work that takes data information into ac-
count. Ye et al. [12] propose a malware detection approach based on interpretable
strings using SVM ensemble with bagging. Islam et al. [13] consider malware
classification based on integrated static and dynamic features, which includes
printable strings. Both Karampatziakis et al. [14] and Tamersoy et al. [15] pro-
pose the malware detection approach based on the file-to-file relation graphs.
More precisely, a file’s legitimacy can be inferred by analyzing its relations (co-
occurrences) with other labeled (either benign or malicious) peers. Clearly, both
the string information and the file relation are quite simple and not all malware
share the same information.
Some recent research work employ deep learning to help detect malware. Saxe
and Berlin [17] propose a deep neural network malware classifier that achieves
a usable detection rate at an extremely low false positive rate and scales to real
world training example volumes on commodity hardware. Hardy et al. [18] de-
velop a deep learning framework for intelligent malware detection based on API
calls. Ye et al. [19] propose a heterogeneous deep learning framework composed
of an AutoEncoder stacked up with multilayer restricted Boltzmann machines
and a layer of associative memory to detect new unknown malware. In addition,
concept drift is also borrowed into malware detection. Roberto et al. [20] propose
a fully tuneable classification system Transcend that can be tailored to be re-
silient against concept drift to varying degrees depending on user specifications.
Our approach employ several supervised machine-learning methods.
3 Approach
In this section, we first give a definition of our malware detection problem, then
present our approach.
The malware detection problem can be stated as follows: given a dataset
D = {(e1 , c1 ), . . . , (em , cm )}, where ei ∈ E is an executable file, ci ∈ C =
{benign, malicious} is the corresponding actual category of ei which may be
unknown, and m is the number of executable files, the goal is to find a function
f : E → C such that ∀i ∈ {1, . . . , m}. f (ei ) ≈ ci .
Indeed, the goal is to find a classifier which classifies each executable file
as precise as possible. Our approach tries to learn a classifier from existing ex-
ecutables with known categories first, and then uses this classifier to predict
categories for new, unseen executables. Figure 1 shows the framework of our ap-
proach, which consists of two components, namely the feature extractor and the
malware classifier. The feature extractor extracts the feature information (i.e.,
opcodes, data types and system libraries) from the executables and represents
them as vectors. While the malware classifier is first trained from an available
dataset of executables with known categories by a supervised machine-learning
method, and then can be used to detect new, unseen executables. In the follow-
ing, we describe both components in more detail.
Information Extraction Next, the extractor parses the asm files decompiled
from executables to extract the information, namely, opcodes, data types and
system libraries. Generally, the opcodes used in an executable represent its in-
tended behaviours, while the data types indicate the structures of the data it
may perform on. In addition, the imported system libraries, which reflect the
interaction between the executable and the system, are also considered. All such
information describes the possible mission of an executable in some sense, and
similar executables share the similar information.
The opcode information can be extracted either statically or dynamically.
Here we just collect and count the opcodes from an asm file instruction by
instruction, yielding an opcode list lopcode , which records the opcodes appearing
in the asm file and their corresponding times. For data type information, we
first discover the possible variables, including local and global ones, and recover
their types. Then we do a statistical analysis on the recovered types, wherein
we just consider their data sizes for the composed types for simplicity, yielding
a type list ltype containing the recovered types and their corresponding times.
The extraction of library information is similar to the one of opcodes, yielding
the library list llibrary .
where d denotes the distance function. Clearly, there are many classifier algo-
rithms to solve this problem, such as K-Nearest Neighbor [22], Native Bayes [23],
Decision Tree [24], Random Forest [25], Support Vector Machine [26], and SGD
Classifier [27]. In our implementation we have made all these methods available,
so users can select any one they like. Moreover, we have also conducted some
experiments with these algorithms and found out that the classifier trained by
Random Forest performs best here (more details will be given in Section 4).
4.1 Implementation
4.2 Experiments
Measure Description
True Positive (TP) Number of files correctly classified as malicious
True Negative (TN) Number of files correctly classified as benign
False Positive (FP) Number of files mistakenly classified as malicious
False Negative (FN) Number of files mistakenly classified as benign
Precision (PPV) TP / (TP+FP)
TP Rate (TPR) TP / (TP+FN)
FP Rate (FPR) FP / (TP+FN)
Accuracy (ACY) (TP+TN) / (TP+TN+FP+FN)
Cross Validation Experiments To evaluate the performance of our approach,
we conduct 10-fold cross validation [33] experiments: in each experiment, we
randomly split our dataset into 10 equal folds; then for each fold, we train the
classifier model based on the data from the other folds (i.e., the training set),
and test the model on this fold (i.e., the testing set). The learning methods we
use in our experiments are listed as follows:
– K-Nearest Neighbour (KNN): experiments are performed under the range
k = 1, k = 3, k = 5, and k = 7.
– Native Bayes (NB): several structural learning algorithms, that is, Gaussian
Naive Bayes, Multinomial Naive Bayes, and Bernoulli Naive Bayes are used
in our experiments.
– Decision Trees (DT): we use decision tree with two criteria, namely, “gini” for
the Gini impurity and “entropy” for the information gain. Random Forest
(RF) with 10 trees are used as well.
– Support Vector Machines (SVM): we perform our experiments with a linear
kernel, a sigmoid kernel, and a Radial Basis Function based kernel (RBF),
respectively. We also try the linear models with stochastic gradient descent
(SGD).
Table 2 shows the detailed results of the experiments and Figure 2 shows
some selected Receiver Operating Characteristic (ROC) curve. All the KNN clas-
sifiers and DT classifiers perform quite well, with an accuracy greater than 97%,
among which Random Forest with 10 entropy trees have produced the best re-
sult. Rather surprisingly, 1-KNN performs better than the other KNNs. While
all the NB classifiers perform a bit worse, with the accuracy from 66.38% to
82.79%. For SVN classifiers, most of them get an accuracy greater than 95%,
except for the one with sigmoid kernel whose accuracy is 85.80%. Concerning
ROC curve, most classifiers can produce much better classification results than
random classification results, except for Gaussian Naive Bayes with the Area Un-
der the ROC (AUC) 0.5931, which is just a little bit bigger than 0.5. Random
Forest with 10 entropy trees perform the best as well (with the AUC 0.9959).
One of the main reasons for Random Forest to outperform others is that it is an
ensemble learning method which may correct the Decision Trees’ habit of over-
fitting to the training set. In the future, some other ensemble learning methods
will be also considered.
We have also evaluated how the feature extractor performs. For that we first
measure the total time needed by the decompilation tool IDA Pro to decompile
the executables. Our data set contains 19379 executables, with the total size of
250GB. It takes 15.6 hours to decompile all the files, with an average 0.22s/MB.
IDA Pro is a heavy-weight tool, so using a lightweight one might have improved
the time. Secondly, we measure the total time needed by extracting features from
the decompiled asm files. The sizes of asm files range from 142KB to 144.84MB,
with the total size of 182GB. It takes 10.5 hours to extract the features for all
the asm files, with an average 0.20s/MB. Both the decompilation time and the
extracting time are considered to be acceptable.
selected from our data set, yielding 50 new malware samples. Secondly, we use
the open source tool Unest [36] to obfuscate 15 obj files, which are compiled
from malware samples in C source codes through VS 20101 , by the following
four techniques, that is, (a) rewriting digital changes equivalently, (b) confus-
ing the output string, (c) pushing the target code segment into the stack and
jumping to it to confuse the target code, and (d) obfuscating the static libraries,
yielding 60 new malware samples.
We used our classifier trained by Random Forest to detect the newly gener-
ated malware samples. The results are presented in Table 6. The results show
that all the obfuscated malware samples have been detected by our classifier. This
indicates that our classifier has some resistance to some obfuscation techniques.
To change code execution flow, Obfuscator inserts lots of jump instructions. It
seems that the jump instructions may change the opcode feature, while the data
feature and the library feature are still the same. Indeed, jump is commonly used
in both malware and benign software. Therefore, this technique does not change
the features we use and thus cannot evade our detection. Similar to Obfuscator,
the techniques used in Unset make little changes on the features, thus cannot
evade our detection either. Nevertheless, the techniques used here is rather sim-
ple. the integration of more (sophisticated) techniques will be considered in the
future.
1
We are able to obfuscate only the obj files compiled from C codes through VS 2010.
5 Conclusion
In this work, we have proposed a malware detection approach using various ma-
chine learning methods based on the opcodes, data types and system libraries.
To evaluate the proposed approach, we have carried out some interesting ex-
periments. Through experiments, we have found that the classifier trained by
Random Forest outperforms others for our data set and all the features we have
adopted are effective for malware detection. The experimental results have also
demonstrated that our classifier is capable of detecting some fresh malware, and
has a resistance to some obfuscation techniques.
As for future work, we may consider some other meaningful features, includ-
ing static features and dynamic features, to improve the approach. We can em-
ploy the unsupervised machine learning methods or the deep learning methods
to train the classifiers. More experiments on malware anti-detecting techniques
are under consideration.
References
1. McAfee Labs Threats Report. June 2017.
2. Philippe Beaucamps and Eric Filiol. On the possibility of practically obfuscating
programs towards a unified perspective of code protection. Journal in Computer
Virology, 3(1):3–21, 2007.
3. Yanfang Ye, Tao Li, Donald Adjeroh, and S. Sitharama Iyengar. A survey on mal-
ware detection using data mining techniques. Acm Computing Surveys, 50(3):41,
2017.
4. Yuval Elovici, Asaf Shabtai, and Moskovitch. Applying machine learning tech-
niques for detection of malicious code in network traffic. In German Conference
on Advances in Artificial Intelligence, pages 44–50, 2007.
5. Mohammad M Masud, Latifur Khan, and Bhavani Thuraisingham. A scalable
multi-level feature extraction technique to detect malicious executables. Informa-
tion Systems Frontiers, 10(1):33–45, 2008.
6. Blake Anderson, Curtis Storlie, and Terran Lane. Improving malware classifica-
tion:bridging the static/dynamic gap. In ACM Workshop on Security and Artificial
Intelligence, pages 3–14, 2012.
7. Yanfang Ye, Tao Li, Yong Chen, and Qingshan Jiang. Automatic malware cate-
gorization using cluster ensemble. In ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2010.
8. Igor Santos, Felix Brezo, Xabier Ugarte-Pedrero, and Pablo G Bringas. Opcode se-
quences as representation of executables for data-mining-based unknown malware
detection. Information Sciences, 231(9):64–82, 2013.
9. Tzu Yen Wang, Shi Jinn Horng, Ming Yang Su, and Chin Hsiung Wu. A surveil-
lance spyware detection system based on data mining methods. In IEEE Interna-
tional Conference on Evolutionary Computation, pages 3236–3241, 2006.
10. Yanfang Ye, Dingding Wang, Tao Li, and Dongyi Ye. Imds: intelligent malware
detection system. In ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, pages 1043–1047, 2007.
11. Yanfang Ye, Tao Li, Kai Huang, Qingshan Jiang, and Yong Chen. Hierarchical
associative classifier (hac) for malware detection from the large and imbalanced
gray list. Journal of Intelligent Information Systems, 35(1):1–20, 2009.
12. Yanfang Ye, Lifei Chen, Dingding Wang, Tao Li, Qingshan Jiang, and Min Zhao.
Sbmds: an interpretable string based malware detection system using svm ensemble
with bagging. Journal in Computer Virology, 5(4):283, 2009.
13. Rafiqul Islam, Ronghua Tian, Steve Versteeg, and Steve Versteeg. Classification
of malware based on integrated static and dynamic features. Journal of Network
and Computer Applications, 36(2):646–656, 2013.
14. Nikos Karampatziakis, Jack W. Stokes, Anil Thomas, and Mady Marinescu. Using
File Relationships in Malware Classification. Springer Berlin Heidelberg, 2013.
15. Acar Tamersoy, Kevin Roundy, and Duen Horng Chau. Guilt by association:
large scale malware detection by mining file-relation graphs. In ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 2014.
16. Gamal Abdel Nassir Mohamed and Norafida Bte Ithnin. Survey on representation
techniques for malware detection system. American Journal of Applied Sciences,
2017.
17. Joshua Saxe and Konstantin Berlin. Deep neural network based malware detection
using two dimensional binary program features. In 2015 10th International Con-
ference on Malicious and Unwanted Software (MALWARE), pages 11–20, 2015.
18. William Hardy, Lingwei Chen, Shifu Hou, Yanfang Ye, and Xin Li. Dl4md: A
deep learning framework for intelligent malware detection. In Proceedings of the
International Conference on Data Mining, 2016.
19. Yanfang Ye, Lingwei Chen, Shifu Hou, William Hardy, and Xin Li. Deepam: a het-
erogeneous deep learning framework for intelligent malware detection. Knowledge
and Information Systems, pages 1–21, 2017.
20. Roberto Jordaney, Kumar Sharad, Santanu K. Dash, Zhi Wang, Davide Papini,
Ilia Nouretdinov, and Lorenzo Cavallaro. Transcend: Detecting concept drift in
malware classification models. In 26th USENIX Security Symposium (USENIX
Security 17), pages 625–642, 2017.
21. Stephen Robertson. Understanding inverse document frequency: on theoretical
arguments for idf. Journal of Documentation, 60(5):503–520, 2004.
22. G Shakhna Rovich, T Darrell, and P Indyk. Nearest-neighbor methods in learning
and vision. Pattern Analysis & Applications, 19(19), 2006.
23. I. Rish. An empirical study of the naive bayes classifier. Journal of Universal
Computer Science, 1(2):127, 2001.
24. J. R Quinlan. Induction on decision tree. Machine Learning, 1(1):81–106, 1986.
25. Andy Liaw and Matthew Wiener. Classification and regression by randomforest.
R News, 23(23), 2002.
26. Alex M. Andrew. An Introduction to Support Vector Machines and Other Kernel-
based Learning Methods. Printed in the United Kingdom at the University Press,
2000.
27. Fasihul Kabir, Sabbir Siddique, and Kotwal. Bangla text document categorization
using stochastic gradient descent (sgd) classifier. In International Conference on
Cognitive Computing and Information Processing, 2015.
28. IDA Pro. http://www.hex-rays.com/idapro/.
29. Oliver Kramer. Scikit-Learn. Springer International Publishing, 2016.
30. Zhiwu Xu, Cheng Wen, and Shengchao Qin. Learning Types for Binaries. In 19th
International Conference on Formal Engineering Methods, 2017.
31. Microsoft Malware Classification Challenge. https://www.kaggle.com/c/
malware-classification.
32. theZoo aka Malware DB. http://ytisf.github.io/theZoo/.
33. Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation
and model selection. In International Joint Conference on Artificial Intelligence,
pages 1137–1143, 1995.
34. DAS MALWERK. http://dasmalwerk.eu/.
35. Obfuscator. https://www.pelock.com/products/obfuscator.
36. Unest. http://unest.org/.