Researchdemo 1
Researchdemo 1
A R T I C L E I N F O A B S T R A C T
Keywords: Application Program Interface (API) calls are widely used in dynamic Windows malware analysis to characterize
Windows malware detection the run-time behavior of malware. Researchers have proposed various approaches to mine semantic information
Graph neural network from API calls to improve the performance of malware analysis. However, with increasingly sophisticated
BERT-based embedding
malware, the exploration of new semantic dimensions for API calls is never-ending. In this paper, we find that
Dynamic API call
the official Windows API documentation is an unexplored information source in malware detection. Therefore,
we propose a novel documentation-augmented Windows malware detection framework DawnGNN using the
pre-trained semantic enhanced mechanism and graph neural network. First, it converts the API sequences into
API graphs for further contextual information extraction. Next, we crawl API documentation from the official
website and employ the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to
encode functionality descriptions as API embeddings. Finally, it feeds the API graphs with API node attributes
into the Graph Attention Network (GAT) classifier to perform Windows malware detection. Moreover, we verify
the effectiveness of DawnGNN on three public datasets. Experimental results demonstrate the effectiveness of
DawnGNN. Semantic information from the official API documentation is promising in the Windows malware
detection domain.
* Corresponding author.
E-mail addresses: pbfeng@xidian.edu.cn (P. Feng), xdutanzhe@gmail.com (L. Gai), yangli@xidian.edu.cn (L. Yang), qinwangtech@gmail.com (Q. Wang),
tengli@xidian.edu.cn (T. Li), nxi@xidian.edu.cn (N. Xi), jfma@mail.xidian.edu.cn (J. Ma).
https://doi.org/10.1016/j.cose.2024.103788
Received 11 September 2023; Received in revised form 24 January 2024; Accepted 23 February 2024
Available online 29 February 2024
0167-4048/© 2024 Elsevier Ltd. All rights reserved.
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788
2
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788
graph neural network and the BERT model. Fig. 3 shows the overall ar-
chitecture of DawnGNN, which consists of three components, namely,
API Graph Constructor, API2Vec Embedding Layer and GNN Classifier.
Firstly, the API graph constructor leverages Cuckoo Sandbox [27] to
perform automatic dynamic analysis on Windows Portable Executable
(PE) programs to extract API call sequences. API calls, indicating the
interactions between programs and system resource usage, are widely
used to make unified behavior representations for malware detection.
Then, it adopts structural dependencies within API sequences to build
API graphs. Next, the API2Vec embedding layer generates API attributes
by encoding official API documentation via the BERT-based language
model. Finally, GNN is adopted to learn contextual information from
attributed API graphs for performing effective malware detection.
Fig. 2. Official API documentation for Windows API Process32NextW. Text in-
3.1. API graph constructor
formation in red box denote API functionality description.
3
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788
4
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788
leverage the one-hot encoding and Word2vec static embedding mech- based message-passing mechanism. This attention mechanism allows
anism to generate API node attributes directly from the collected API GAT to adaptive allocate attention weights to neighbor nodes. Next, it
sequences. Next, we compare the BERT encoding mechanism with the leverages the weighted sum of neighbor nodes to update the represen-
above two mechanisms to highlight the importance of semantic infor- tation of the current node. In addition, GAT has the advantage of strong
mation. generalization for directed graphs.
The network structure of the GAT is shown in Fig. 8. As shown in
3.3. GNN classifier Fig. 8, at first, each API’s official documentation within the API graph
is fed into BERT to extract semantic embedding. Then, the API node
After the processing of the API2Vec Embedding Layer component, we embeddings and API graph structure are used as the input of GAT to
obtain a number of dynamic API graphs with corresponding node at- compute graph embedding with structure and semantic information.
tributes. Then, the GAT [29] classifier is trained on these API graphs During the iterative process of every layer within GAT, the semantic
to extract the structural information and further perform Windows mal- embedding of an API node is passed to its neighbor nodes. With the
ware detection. GAT is a graph neural network based on an attention- help of the multi-head attention mechanism, each API node can focus
5
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788
on more critical neighbor nodes. Given two connected API nodes 𝑖 and Table I
𝑗 , the attention weight 𝛼 at attention head 𝑡 and 𝑙 -th layer structure is Three datasets of API calls for experimental evaluation.
calculated according to the following formula: # of Malicious # of Benign Released
Dataset
Samples Samples Date
exp(ReLU(𝐹𝑎 [𝑊𝑡 𝐻𝑖𝑙 ||𝑊𝑡 ℎ𝑙𝑗 ]))
𝑡
𝛼𝑖,𝑗 =∑ , (2) MalBehavD-V1 [26] 1,285 1,285 2022
𝑐∈𝑅𝑖 exp(ReLU(𝐹𝑎 [𝑊𝑡 𝐻𝑖𝑙 ||𝑊𝑡 ℎ𝑙𝑐 ])) PE_APICALLS [31] 452 101 2019
APIMDS [32] 23,080 300 2015
where ℎ𝑙 denotes the hidden representation of an API node at the 𝑙 -
th layer, ℎ0 equals the semantic embedding of an API node generated
by the BERT model, 𝑊𝑚 represents the learnable parameters at 𝑡-th Table II
attention head, 𝐹𝑎 is a feedforward neural network, ReLU indicate the Statistics of generated API graphs on three datasets.
rectifier activation function, || denotes the concatenation operation, and Dataset label Avg. # of Nodes Avg. # of Edges
𝑅𝑖 represents the neighbor nodes of API node 𝑖. Then, the update pro-
MalBehavD-V1 malware 41.54 34.61
cess of each API node’s embedding based on the attention mechanism benign 42.61 31.22
is formalized as follows:
PE_APICALLS malware 37.87 29.19
∑
ℎ𝑙+1 = ‖𝑇𝑡=1 𝜎( 𝑡
𝑊𝑡 ℎ𝑙𝑗 ),
benign 19.64 28.10
𝑖 𝛼𝑖,𝑗 (3)
𝑗∈𝑅𝑖 APIMDS malware 108.37 30.27
benign 42.68 31.68
where 𝑇 denotes the number of attention heads, 𝑡 represents the 𝑡-th
attention head. Finally, GAT updates the API node embeddings of the
API graph and sums the graph semantic embedding 𝑠𝐺 as follows: i7-12700 CPU @ 2.10 GHz, 16.0 GB RAM, NVIDIA RTX 3060, and 512
𝑁𝐺 GB for the hard disk drive. DawnGNN was implemented in Python pro-
∑ gramming language version 3.8.10 with PyTorch 2.0.0 and Transform-
𝑠𝐺 = ℎ𝐿
𝑖 . (4)
𝑖=0 ers 4.28.1 framework and other libraries such as Scikit-learn, Numpy,
Pandas, and Requests have been also used. The framework takes se-
The final prediction classification is performed via a Multilayer Percep-
quences of API calls extracted from Windows exe files as input.
tron (MLP) model, which can be represented as:
We collected three existing datasets of malicious and normal API
calls for our experimental evaluation. The information of these datasets
𝑌 = MLP(𝑠𝐺 |𝜃𝑚𝑙𝑝 ), (5)
is summarized in Table I. As described in Section 3.1, we generate API
where 𝜃𝑚𝑙𝑝 denotes the learnable parameters of MLP model, 𝑌 repre- graphs based on collected run-time API call sequences. The statistics
sents the final classification label malware or benign. of generated API graphs on three datasets are shown in Table II. From
In this paper, we leverage the GNN model to generate graph em- Table II, we can observe that the API graphs are sparse forms, which
bedding [30] via encoding all node hidden representations and graph is suitable for graph classification tasks. Using different datasets allows
structure information into low-dimensional space. The node hidden us to evaluate the malware detection performance of DawnGNN from
representation is transformed from aggregating local neighbor node in- multiple dimensions. Specifically, we randomly shuffle the dataset and
formation. DawnGNN also adopts Graph Convolutional Network (GCN) split 80% for the training, 10% for validation, and the rest 10% for
and Graph Isomorphism Network (GIN), and compares their perfor- testing.
mance to identify the most effective mechanism in malware detection.
GCN is another representative GNN, where node hidden representation 4.2. Measure metrics
is calculated via the following formulas:
We evaluate the Windows malware detection performance of
̂ 𝑙𝑊 𝑙)
𝐻 𝑙+1 = ReLU(𝐴𝐻 (6) DawnGNN with the following five metrics: precision, recall, true nega-
where 𝐻 𝑙 indicates the hidden representation matrix at the 𝑙 -layer for tive rate, accuracy, and F1-score. These metrics are computed via true
all nodes, and 𝐻 0 denotes the all API node embeddings generated by positive (TP), true negative (TN), false positive (FP), and false negatives
the API2Vec Embedding Layer component. 𝑊 𝑙 is the learnable weight (FN). In the Windows malware detection scene, TP denotes the count
1 1 of correctly identified malicious exe files, and TN denotes the count of
parameters of the 𝑙 -layer GCN. 𝐴̂ = 𝑀̃ − 2 𝐴̃ 𝑀
̃ − 2 , where 𝑀
̃ denotes the
correctly detected benign exe files. FP indicates the count of misidenti-
̃
degree matrix, and 𝐴 = 𝐴 + 𝐼𝑠 . 𝐼𝑠 is the identity matrix. GIN adopts an
fied malicious exe files, and FN denotes the count of missed malicious
MLP model to aggregate comprehensive information, which is formal-
exe files. The above measure metrics are calculated as follows:
ized as:
∑ TP
Precision = (8)
ℎ𝑙+1
𝑖 = MLP𝑙+1 ((1 + 𝜖 𝑙 )ℎ𝑙𝑖 + ℎ𝑙𝑗 ), (7) TP + FP
𝑗∈𝑅(𝑖) TP
Recall = (9)
TP + FN
where 𝜖𝑡 represents scalar learnable parameters.
TN
True negative rate (TNR) = (10)
TN + FP
4. Experiments and evaluation
TP + TN
Accuracy (Acc) = (11)
TP + TN + FP + FN
In this section, we comprehensively evaluate our proposed system
2 × Precision × Recall
DawnGNN via various experiments. In the following, we first describe F1-score (F1) = (12)
Precision + Recall
the experiment settings and dataset used in DawnGNN. And then, we
discuss the results of our experiments. 4.3. Performance of malware detection
4.1. Experimental setup and dataset In this paper, DawnGNN performs Windows malware detection via
graph neural network and BERT-based semantic enhanced mechanism.
The proposed framework DawnGNN was implemented and tested Therefore, in this experiment, we comprehensively evaluate the effec-
on a computer running Ubuntu 20.04 (64-bit) with Intel(R) Core (TM) tiveness of the two main components.
6
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788
Table III Table IV. From Table IV, we can observe our BERT-based encoding
Detection performance comparison with different detection methods in mechanism improves the malware detection performance in the two
dataset MalBehavD-V1. imbalanced datasets via semantic information extracted from API doc-
Detection method Precision Recall TNR Acc F1 umentation. In the two datasets, the TNR is lower when compared with
Precision and Recall. This is caused by that the count of the malware
one-hot + RF 0.9009 0.9494 0.8908 0.9203 0.9231
one-hot + LSTM 0.9150 0.9469 0.9106 0.9283 0.9295 is far beyond the count of the benign. In addition, the model cannot
one-hot + GAT 0.9195 0.9586 0.9074 0.9339 0.9368 characterize benign patterns without enough samples.
Word2Vec + LSTM 0.9323 0.9661 0.9277 0.9441 0.9475
Word2Vec + GAT 0.9559 0.9532 0.9436 0.9432 0.9543 4.4. Comparison of GNN algorithms
BERTsmall + LSTM 0.9609 0.9584 0.9512 0.9508 0.9595
BERTsmall + GAT 0.9667 0.9756 0.9527 0.9607 0.9683
BERTbase + LSTM 0.9632 0.9615 0.9553 0.9535 0.9618 In this section, we explore the influence of multiple GNN algorithms
BERTbase + GAT 0.9697 0.9788 0.9556 0.9638 0.9711 on Windows malware detection performance.
The BERTsmall and BERTbase represent different versions small and base
We tune hyper-parameters that significantly affect the detection
of the pre-trained BERT encoding mechanism. performance according to domain knowledge to choose the optimal
detection performance. The experiment is conducted on the dataset
MalBehavD-V1 to eliminate the interference of imbalanced samples.
Firstly, we compare the BERT-based encoding mechanism with one-
The search range of hyper-parameters and optimal values for three GNN
hot and Word2Vec-based encoding methods to highlight the effective-
models are shown in Table V. From Table V, we can observe that the
ness of semantic information extracted from API documentation. In the
three GNN algorithms achieve the best performance at 100 epochs with
one-hot encoding method, each API call is transformed into a binary
a batch size of 128 and a learning rate of 0.0001. The number of graph
vector where each position represents a unique API call. The dimension
neural network layers is all in multi-layers when generating the best
of the binary vector is equal to the count of all collected APIs. In the
Word2Vec-based encoding method, we treat each API call as a word performance. The optimal hidden dimensions for the three models are
and use a neural network model to learn word associations from a large 32 for GCN, 16 for GIN, and 12 for GAT The neural networks’ hidden
set of API sequences. dimensions achieve the best performance when GCN is 32, GIN is 16,
Next, we compare the graph feature-based detection method with and GAT is 12.
the statistical and sequence feature-based detection methods to illus- We compare the detection performance of DawnGNN by using three
trate the effectiveness of the design of graph feature learning. We representative GNN algorithms: GCN, GIN, and GAT. The detection per-
combine the one-hot vector representation of API call sequences with formance comparison of three GNN algorithms is shown in Table VI.
the Random Forest (RF) model as the statistical feature-based detection From Table VI, we can observe that GAT provides the best detection
method. We leverage the long short-term memory (LSTM) model to han- performance. The experiment results show that DawnGNN with GIN ex-
dle API sequences, which constructs the typical sequence feature-based hibits superior detection performance over GCN, which illustrates that
detection method. In particular, the LSTM model requires API encoding the adaptation of the powerful message aggregation function MLP leads
mechanisms before inputting API sequences, which could be one of the to the improvement of GIN when compared with GCN. The GAT algo-
three encoding mechanisms. rithm employs an attention mechanism to adaptive aggregate important
Specifically, we compare the detection methods mentioned above information from neighbor nodes. In addition, the ability to handle di-
in the symmetric dataset MalBehavD-V1. We set the Word2Vec and RF rected graphs makes GAT more suitable for API graph scenes. Therefore,
algorithms in the default setting. For the BERT-based encoding mech- the DawnGNN with GAT achieves the best performance when compared
anism, we select the base and small versions according to our ex- with GIN and GCN. We also plot the detection performance variation
periment environment. Specifically, we perform the MLM task on the rule with epochs in Fig. 9 to observe fluctuations in detection per-
collected official functionality descriptions with bert_base_uncased and formance. From Fig. 9, we can observe that our proposed Windows
bert_small as the initial pre-trained model and extract the last hidden malware detection framework achieves good performance on all three
layer as the API embedding. The dimensions of the generated API em- GNN algorithms. This illustrates that the combination of GNN models
beddings are 512 and 768, respectively. As there are multiple versions and BERT-based enhancing mechanisms is promising in Windows mal-
of the BERT model and many variant models, such as RoBERTa [33] and ware detection.
SENTENCE-BERT [34], we leave the exploration of the optimal encod-
ing mechanism as a future work. For LSTM, we refer to the parameter 4.5. Comparison with other approaches
settings within the existing detection method [35].
The Windows malware detection performance with different encod- In the following, we compare DawnGNN with existing Windows
ing mechanisms and feature structures are shown in Table III. From malware detection approaches to verify the effectiveness of our pro-
Table III, we can observe that our BERT-based semantic enhanced posed detection performance on public datasets.
mechanism improves the malware detection performance under every We examine the performance of the DawnGNN framework against
type of learning model. This illustrates that API documentation con- other existing detection approaches based on API call sequences ex-
tains rich semantic information for identifying Windows malware. The tracted from exe files and comparative results are presented in Ta-
BERT in base version outperforms the small version, which illustrates ble VII. First, we compare DawnGNN against MalDy [36] and MalDet-
that larger API embeddings carry more precision semantic information Conv [26] on dataset MalBehavD-V1 and PE_APICALLS. MalDy pro-
in our API documentation encoding case. In addition, the pre-trained posed to leverage h-grams, feature hashing, and Term Frequency–
BERT model has the ability to extract context-sensitive information Inverse Document Frequency (TF-IDF) to vectorize the behavior reports.
from API documentation. From Table III, we can also observe that the Then, an ensemble prediction framework is constructed to perform
graph feature learning method is superior to the statistical and sequence precise malware detection. MalDetConv designed a new automated
feature-based methods, in each of the three encoding mechanisms. This behavior-based detection framework, which constructs a hybrid of Con-
illustrates that the graph feature-based method considers the struc- volutional Neural Network (CNN) and Bidirectional Gated Recurrent
ture information, improving the detection performance compared to the Unit (BiGRU) models to perform high dimensional representations of
method that only considers sequence or statistical information. API call sequences and then leverages a fully connected neural network
We also evaluate the effectiveness of DawnGNN with different en- module for malware detection. On dataset MalBehavD-V1, DawnGNN
coding mechanisms in datasets PE_APICALLS and APIMDS. The Win- achieved the detection accuracy of 0.9638, creating an improvement
dows malware detection performance in the two datasets is shown in of 0.79% and 0.51% detection accuracy of MalDy and MalDetConv,
7
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788
Table IV
Detection performance comparison with different encoding mechanisms in
dataset PE_APICALLS and APIMDS.
Table V
Search range of hyper-parameters and optimal values for three GNN models.
8
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788
Table VII
Detection performance comparison with existing approaches.
Dataset Approaches Behavior Feature Feature Vectorization Method ML/DL Algorithm Detection Accuracy
MalBehavD-V1 MalDy [36] Behavior reports N-grams+ Feature Ensemble learning 0.9559
from analysis sandbox Hashing+ TF-IDF
MalDetConv [26] API call sequences Word2vec Hybrid model 0.9610
CNN-BiGRU
DawnGNN API call sequences BERT-based semantic GAT 0.9638
+ API documentation enhancement
PE_APICALLS MalDy [36] Behavior reports N-grams+ Feature Ensemble learning 0.9518
from analysis sandbox Hashing+ TF-IDF
MalDetConv [26] API call sequences Word2vec Hybrid model 0.9573
CNN-BiGRU
DawnGNN API call sequences BERT-based semantic GAT 0.9762
+ API documentation enhancement
APIMDS Amer and Zelinka [7] API call sequences Word2Vec+ clustering Markov chain 0.9990
similarity model
Ki et al. [32] API call sequences DNA sequence Similarity 0.9980
alignment matching
Tran and Sato [37] API call sequences TF-IDF SVM 0.9619
MalDetConv [26] API call sequences Word2vec Hybrid model 0.9993
CNN-BiGRU
DawnGNN API call sequences BERT-based semantic GAT 0.9975
+ API documentation enhancement
tion approaches. Mal-Bert-GCN [18] built directed process graphs from determines whether it has malicious behaviors by examining API calls,
raw API sequences and employed GCN to perform malware detection. network traffic, and other critical characteristics. API calls can pro-
Similarly, our BERT-based semantic enhanced mechanism is orthogonal vide valuable run-time information for identifying malicious activities.
to inter-process interaction information. In this paper, we only extract Therefore, researchers have proposed a lot of API call-based detection
the functional descriptions for an API call to extract semantic informa- approaches.
tion. Other numerous descriptions, including parameters, return value, Researchers have focused on extracting more effective features from
remarks, and requirements, are potential useful information sources API call sequences to perform malware detection for a long time. Fang
for malware detection when considering the boom in large language et al. [51] use a hash function to encode the API call names, return
models [45]. We leave the design of a more comprehensive detection values, and module names for obtaining more detailed behavior infor-
framework via deeply digging semantic information from API documen- mation. Agrawal et al. [52] perform one-hot encoding on API call se-
tation as future work. quences and n-gram encoding on API parameters. Unfortunately, these
Our documentation-augmented malware detection framework is approaches only consider partial parameters or treat all parameters as
based on the classic BERT-based encoding mechanism and standard strings, which cannot fully explore the information from various param-
GNN algorithm. The encoding mechanism could be improved by the eters. Zhang et al. [10] employ different hashing strategies to encode
large version or other optimized models, such as RoBERTa, AL- API names and various parameters, which still cannot express semantic
BERT [46], DistillBERT [47], etc. The standard GNN algorithm could be information. Rabadi and Teo [53] divide the API parameters into multi-
improved by jumping knowledge networks [48], self-supervised learn- ple representation sets before applying feature hashing, which may lead
ing mechanisms [49], or other superior graph learning algorithms [50]. to the loss of distinction.
We leave the exploration of the optimal encoding mechanisms and Many studies have applied ML and DL models to analyze API call se-
graph learning models as future work for performing highly precision quences. Qiao et al. [54] leverage frequent itemset mining and similar-
malware detection. We verify the effectiveness of the proposed seman- ity calculation to process the API names and parameters within API call
tic enhanced mechanism on three publicly available dynamic malware sequences. Uppal et al. [8] select the important feature via frequency
datasets. We plan to make stronger experiment validation when obtain- statistics from API call sequences, and then employ the Support Vec-
ing more recent and diverse datasets. The semantic enhanced mecha- tor Machine (SVM) classifier to perform malware detection. The above
nism could be complementary to enhance current Windows malware methods ignore the relationship between API calls and can be easily
detection methods. We leave the enhancement validation as future re- evaded by modifying the frequency counter value. Daht et al. [55] em-
search with open source or re-implementation versions of current de- ploy n-grams to process the system API call sequences and then leverage
tection methods. In addition, the combination design of graph neural the logistic regression and shallow neural network classifier to perform
network and BERT-based semantic enhanced mechanism can also be malware classification. Ndibanje et al. [56] construct feature vectors
applicable to malware detection in other platforms, like Android and from API sequences and employ similarity-based statistics methods to
Linux. The design of bringing in external software operation knowledge detect malware. Zhang et al. [57] construct API relationship graphs
is promising in the malware detection domain. to represent the internal relationships among various programs. Then,
they leverage the knowledge graph embedding algorithm to input the
6. Related work
API graphs into RF, Model Pool, SVM, and Deep Neural Network (DNN)
This section provides a detailed discussion of relevant literature on to perform malware detection. Pascanu et al. [58] employ Recurrent
dynamic Windows malware detection and BERT-based security detec- Neural Networks (RNNs) to capture sequence relations between APIs
tion methods. and feed the outputs of RNNs into a max-pooling layer for malware clas-
sification. kolosnjaji et al. [9] leverage CNN to process consecutive API
6.1. Dynamic windows malware detection sequences and apply LSTM to handle time-series dependence. Agrawal
et al. [52] propose to construct several stacked LSTMs to process API
The dynamic method executes a program in a controlled environ- names and string parameters. Zhang et al. [10] design a hybrid deep
ment which observes the execution status of the program and then learning framework including gate-CNNs and Bi-LSTM to process API
9
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788
names and parameters for performing malware detection. Researchers Declaration of competing interest
try to transform the API call sequences into graphs to capture the direct
or indirect relationship between API calls. Jiang et al. [59] transform The authors declare that there are no conflicts of interest regarding
the API sequences of exe files into a call graph by matching the caller- the publication of this article. All authors have contributed to, read, and
callee relationships. Then, graph embedding techniques and stacked approved this submitted manuscript in its current form.
denoising autoencoders are combined to perform malware detection.
Amer and Zelinka [7] leverage contextual similarity in API sequences Data availability
to cluster APIs, and employ the Markov chain to capture relationships
and perform malware detection. In this paper, we combine the GNN Data will be made available on request.
model and BERT-based semantic enhancement mechanism to classify
API graphs and perform effective Windows malware detection. Acknowledgements
6.2. BERT-based security detection This research was funded by the Major Research plan of the Na-
tional Natural Science Foundation of China (Grant No. 92267204), the
BERT has emerged as a powerful natural language processing model Natural Science Basic Research Program of Shaanxi (Program No. 2023-
that is capable of learning bidirectionally contextual representations. JC-QN-0759, 2022JM-338), and the Fundamental Research Funds for
With great success in various language tasks, the BERT model has been the Central Universities (Project No.: XJSJ23184).
applied in security detection domains, such as malware detection [60],
vulnerability detection [61], and malicious traffic detection [62], etc. References
MalBert [63] and Badr et al. [60] design a BERT-based framework to
perform Android malware detection and classification from elements [1] SonicWall, Mid-year update to the 2023 sonicwall cyber threat report, https://www.
sonicwall.com/2023-mid-year-cyber-threat-report/. (Accessed 23 January 2024).
extracted from Android Manifest file. SmartConDetect [61] proposes to
[2] J. Singh, J. Singh, Detection of malicious software by analyzing the behavioral arti-
extract code fragments via a static analysis tool and then feed them facts using machine learning algorithms, Inf. Softw. Technol. 121 (2020) 106273.
into a pre-trained BERT model to perform vulnerability detection in [3] Z. Sun, Z. Rao, J. Chen, R. Xu, D. He, H. Yang, J. Liu, An opcode sequences analysis
smart contracts. BINSHOT [64] designs a BERT-based similarity learn- method for unknown malware detection, in: Proceedings of the 2019 2nd Interna-
ing architecture to perform effectively binary code similarity detection. tional Conference on Geoinformatics and Data Analysis, 2019, pp. 15–19.
[4] D. Yuxin, Z. Siyi, Malware detection based on deep learning algorithm, Neural Com-
The architecture adopts a weighted distance vector with a binary cross put. Appl. 31 (2019) 461–472.
entropy as a loss function. ET-BERT [62] proposes a new BERT-based [5] W. Han, J. Xue, Y. Wang, L. Huang, Z. Kong, L. Mao, Maldae: detecting and explain-
encrypted traffic representation model, which could capture deep con- ing malware based on correlation and fusion of static and dynamic characteristics,
textualized datagram-level representation from large-scale unlabeled Comput. Secur. 83 (2019) 208–233.
[6] Z. Salehi, A. Sami, M. Ghiasi, Maar: robust features to detect malicious activity based
data and perform effective malicious traffic detection. Enimanal [65]
on api calls, their arguments and return values, Eng. Appl. Artif. Intell. 59 (2017)
proposes a specialized BERT model, which leverages the declarations 93–102.
within the Linux manual to extract semantic information for each sys- [7] E. Amer, I. Zelinka, A dynamic windows malware detection and prediction method
tem call, and then utilizes GNNs to perform cross-architecture IoT based on contextual understanding of api call sequence, Comput. Secur. 92 (2020)
malware analysis. CoDOC [22] proposes a fusion framework to accu- 101760.
[8] D. Uppal, R. Sinha, V. Mehra, V. Jain, Malware detection and classification based
rately identify sensitive Android source and sink methods. This frame-
on extraction of api sequences, in: 2014 International Conference on Advances in
work leverages graph learning to encode source code information and Computing, Communications and Informatics (ICACCI), IEEE, 2014, pp. 2337–2342.
a BERT-based model to extract semantic information from Android [9] B. Kolosnjaji, A. Zarras, G. Webster, C. Eckert, Deep learning for classification of
documentation. Similarly, in this paper, DawnGNN crawls the official malware system call sequences, in: AI 2016: Advances in Artificial Intelligence:
29th Australasian Joint Conference, Hobart, TAS, Australia, December 5-8, 2016,
Windows API documentation and leverages a BERT-based semantic ex-
Proceedings, vol. 29, Springer, 2016, pp. 137–149.
traction mechanism to enhance malware detection. [10] Z. Zhang, P. Qi, W. Wang, Dynamic malware analysis with feature engineering and
feature learning, Proc. AAAI Conf. Artif. Intell. 34 (01) (2020) 1210–1217.
7. Conclusion [11] M. Fan, J. Liu, X. Luo, K. Chen, Z. Tian, Q. Zheng, T. Liu, Android malware familial
classification and representative sample selection via frequent subgraph analysis,
In this paper, we propose a novel dynamic Windows malware de- IEEE Trans. Inf. Forensics Secur. 13 (8) (2018) 1890–1905.
[12] Z. Lin, F. Xiao, Y. Sun, Y. Ma, C.-C. Xing, J. Huang, A secure encryption-based
tection system using graph neural networks and a BERT-based semantic malware detection system, KSII Trans. Int. Inf. Syst. 12 (4) (2018) 1799–1818.
enhancement mechanism, called DawnGNN. It constructs API graphs [13] F.O. Catak, A.F. Yazı, O. Elezaj, J. Ahmed, Deep learning based sequential model for
directly from API call sequences and leverages the BERT-based model malware analysis using windows exe api calls, PeerJ Comput. Sci. 6 (2020) e285.
to extract API semantic information from official API documentation. [14] C. Li, Q. Lv, N. Li, Y. Wang, D. Sun, Y. Qiao, A novel deep framework for dynamic
malware detection based on api sequence intrinsic features, Comput. Secur. 116
By feeding the semantic information of API nodes and the directed API
(2022) 102686.
graphs into the GAT, DawnGNN performs effective Windows malware [15] X. Chen, Z. Hao, L. Li, L. Cui, Y. Zhu, Z. Ding, Y. Liu, Cruparamer: learning on
detection. On three public datasets, we verify that our BERT-based en- parameter-augmented api sequences for malware detection, IEEE Trans. Inf. Foren-
coding mechanism improves the detection mechanism compared with sics Secur. 17 (2022) 788–803.
one-hot and Word2Vec-based encoding mechanisms, and DawnGNN [16] X. Chen, Y. Tong, C. Du, Y. Liu, Z. Ding, Q. Ran, Y. Zhang, L. Cui, Z. Hao, Malpro:
learning on process-aware behaviors for malware detection, in: 2022 IEEE Sympo-
outperforms other traditional detection methods only using raw API call sium on Computers and Communications (ISCC), IEEE, 2022, pp. 01–07.
sequences. In addition, we find that the official API documentation is [17] C. Li, Z. Cheng, H. Zhu, L. Wang, Q. Lv, Y. Wang, N. Li, D. Sun, Dmalnet: dynamic
an unexplored informative source and the BERT-based documentation malware analysis based on api feature engineering and graph learning, Comput.
augmented mechanism is promising in Windows malware detection. Secur. 122 (2022) 102872.
[18] Z. Ding, H. Xu, Y. Guo, L. Yan, L. Cui, Z. Hao, Mal-bert-gcn: malware detection
by combining bert and gcn, in: 2022 IEEE International Conference on Trust, Se-
CRediT authorship contribution statement curity and Privacy in Computing and Communications (TrustCom), IEEE, 2022,
pp. 175–183.
Pengbin Feng: Methodology, Software, Writing – original draft, In- [19] H. Gao, S. Cheng, W. Zhang, Gdroid: Android malware detection and classification
vestigation. Le Gai: Software, Visualization, Writing – original draft. with graph convolutional network, Comput. Secur. 106 (2021) 102264.
Li Yang: Data curation, Software. Qin Wang: Methodology, Writing [20] J.K. Siow, S. Liu, X. Xie, G. Meng, Y. Liu, Learning program semantics with code
representations: an empirical study, in: 2022 IEEE International Conference on Soft-
– original draft, Software. Teng Li: Methodology, Writing – review & ware Analysis, Evolution and Reengineering (SANER), IEEE, 2022, pp. 554–565.
editing. Ning Xi: Supervision, Writing – review & editing. Jianfeng [21] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: pre-training of deep bidirectional
Ma: Funding acquisition, Supervision, Writing – review & editing. transformers for language understanding, arXiv preprint, arXiv:1810.04805, 2018.
10
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788
[22] J. Samhi, M. Kober, A.K. Kabore, S. Arzt, T.F. Bissyandé, J. Klein, Negative results of [52] R. Agrawal, J.W. Stokes, M. Marinescu, K. Selvaraj, Neural sequential malware
fusing code and documentation for learning to accurately identify sensitive source detection with parameters, in: 2018 IEEE International Conference on Acoustics,
and sink methods: an application to the Android framework for data leak detec- Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 2656–2660.
tion, in: 2023 IEEE International Conference on Software Analysis, Evolution and [53] D. Rabadi, S.G. Teo, Advanced windows methods on malware detection and classi-
Reengineering (SANER), IEEE, 2023, pp. 783–794. fication, in: Annual Computer Security Applications Conference, 2020, pp. 54–68.
[23] statcounter GlobalStats, Desktop operating system market share worldwide, https:// [54] Y. Qiao, Y. Yang, L. Ji, J. He, Analyzing malware by abstracting the frequent itemsets
gs.statcounter.com/os-market-share/desktop/worldwide. (Accessed 1 September in api call sequences, in: 2013 12th IEEE International Conference on Trust, Security
2023). and Privacy in Computing and Communications, IEEE, 2013, pp. 265–270.
[24] G.H. Andreas Marx, Oliver Marx, Total amount of malware and pua, https://portal. [55] G.E. Dahl, J.W. Stokes, L. Deng, D. Yu, Large-scale malware classification using
av-atlas.org/malware. (Accessed 1 August 2023). random projections and neural networks, in: 2013 IEEE International Conference on
[25] D. Uppal, R. Sinha, V. Mehra, V. Jain, Exploring behavioral aspects of api Acoustics, Speech and Signal Processing, IEEE, 2013, pp. 3422–3426.
calls for malware identification and categorization, in: 2014 International Con- [56] B. Ndibanje, K.H. Kim, Y.J. Kang, H.H. Kim, T.Y. Kim, H.J. Lee, Cross-method-based
ference on Computational Intelligence and Communication Networks, IEEE, 2014, analysis and classification of malicious behavior by api calls extraction, Appl. Sci.
pp. 824–828. 9 (2) (2019) 239.
[26] P. Maniriho, A.N. Mahmood, M.J.M. Chowdhury, Api-maldetect: automated mal- [57] X. Zhang, Y. Zhang, M. Zhong, D. Ding, Y. Cao, Y. Zhang, M. Zhang, M. Yang,
ware detection framework for windows based on api calls and deep learning tech- Enhancing state-of-the-art classifiers with api semantics to detect evolved Android
niques, J. Netw. Comput. Appl. (2023) 103704. malware, in: Proceedings of the 2020 ACM SIGSAC Conference on Computer and
[27] C. Guarnieri, Automated malware analysis, https://cuckoosandbox.org/. (Ac- Communications Security, 2020, pp. 757–770.
cessed 27 August 2021). [58] R. Pascanu, J.W. Stokes, H. Sanossian, M. Marinescu, A. Thomas, Malware classifica-
[28] F.O. Catak, Windows malware dataset with pe api calls, https://github.com/ocatak/ tion with recurrent networks, in: 2015 IEEE International Conference on Acoustics,
malware_api_class. (Accessed 27 August 2021). Speech and Signal Processing (ICASSP), IEEE, 2015, pp. 1916–1920.
[29] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, Graph atten- [59] H. Jiang, T. Turki, J.T. Wang, Dlgraph: malware detection using deep learning and
tion networks, arXiv preprint arXiv:1710.10903, 2017. graph embedding, in: 2018 17th IEEE International Conference on Machine Learning
[30] H. Cai, V.W. Zheng, K.C.-C. Chang, A comprehensive survey of graph embedding: and Applications (ICMLA), IEEE, 2018, pp. 1029–1033.
problems, techniques, and applications, IEEE Trans. Knowl. Data Eng. 30 (9) (2018) [60] B. Souani, A. Khanfir, A. Bartel, K. Allix, Y. Le Traon, Android malware detection
1616–1637. using bert, in: International Conference on Applied Cryptography and Network Se-
[31] N. Allan, J. Ngubiri, Windows pe api calls for malicious and benigin programs, curity, Springer, 2022, pp. 575–591.
09.2019. [61] S. Jeon, G. Lee, H. Kim, S.S. Woo, Smartcondetect: highly accurate smart contract
[32] Y. Ki, E. Kim, H.K. Kim, A novel approach to detect malware based on api call code vulnerability detection mechanism using bert, in: KDD Workshop on Program-
sequence analysis, Int. J. Distrib. Sens. Netw. 11 (6) (2015) 659101. ming Language Processing, 2021.
[33] Z. Liu, W. Lin, Y. Shi, J. Zhao, A robustly optimized bert pre-training approach with [62] X. Lin, G. Xiong, G. Gou, Z. Li, J. Shi, J. Yu, Et-bert: a contextualized datagram
post-training, in: China National Conference on Chinese Computational Linguistics, representation with pre-training transformers for encrypted traffic classification, in:
Springer, 2021, pp. 471–484. Proceedings of the ACM Web Conference 2022, 2022, pp. 633–642.
[34] N. Reimers, I. Gurevych, Sentence-bert: sentence embeddings using Siamese bert- [63] A. Rahali, M.A. Akhloufi, Malbert: malware detection using bidirectional encoder
networks, arXiv preprint arXiv:1908.10084, 2019. representations from transformers, in: 2021 IEEE International Conference on Sys-
[35] M. Ring, D. Schlör, S. Wunderlich, D. Landes, A. Hotho, Malware detection on win- tems, Man, and Cybernetics (SMC), IEEE, 2021, pp. 3226–3231.
dows audit logs using lstms, Comput. Secur. 109 (2021) 102389. [64] S. Ahn, S. Ahn, H. Koo, Y. Paek, Practical binary code similarity detection with bert-
[36] E.B. Karbab, M. Debbabi, Maldy: portable, data-driven malware detection using nat- based transferable similarity learning, in: Proceedings of the 38th Annual Computer
ural language processing and machine learning techniques on behavioral analysis Security Applications Conference, 2022, pp. 361–374.
reports, Digit. Investig. 28 (2019) S77–S87. [65] L. Deng, H. Wen, M. Xin, H. Li, Z. Pan, L. Sun, Enimanal: augmented cross-
[37] T.K. Tran, H. Sato, Nlp-based approaches for malware classification from api se- architecture iot malware analysis using graph neural networks, Comput. Secur.
quences, in: 2017 21st Asia Pacific Symposium on Intelligent and Evolutionary (2023) 103323.
Systems (IES), IEEE, 2017, pp. 101–105.
[38] A. Brown, M. Gupta, M. Abdelsalam, Automated machine learning for deep learning
based malware detection, arXiv preprint arXiv:2303.01679, 2023. Pengbin Feng received the Ph.D. degrees in computer science from Xidian Univer-
[39] J. Geng, J. Wang, Z. Fang, Y. Zhou, D. Wu, W. Ge, A survey of strategy-driven sity, Xi’an, Shaanxi, China in 2019. He is currently a lecturer in the School of Cyber
evasion methods for pe malware: transformation, concealment, and attack, Comput. Engineering, Xidian University. His research interests include malware detection and bi-
Secur. (2023) 103595. nary analysis.
[40] W. You, Z. Zhang, Y. Kwon, Y. Aafer, F. Peng, Y. Shi, C. Harmon, X. Zhang, Pmp:
cost-effective forced execution with probabilistic memory pre-planning, in: 2020 Le Gai is currently a BSc student with the School of Computer Science and Tech-
IEEE Symposium on Security and Privacy (SP), IEEE, 2020, pp. 1121–1138. nology at Xidian University, China. His current research interests include learning-based
[41] F. Barr-Smith, X. Ugarte-Pedrero, M. Graziano, R. Spolaor, I. Martinovic, Survival- security detection and privacy protection.
ism: systematic analysis of windows malware living-off-the-land, in: 2021 IEEE
Symposium on Security and Privacy (SP), IEEE, 2021, pp. 1557–1574. Li Yang received the Ph.D. degrees in computer science from Xidian University, Xi’an,
[42] M.A. Talib, Q. Nasir, A.B. Nassif, T. Mokhamed, N. Ahmed, B. Mahfood, Apt bea- Shaanxi, China in 2010. He is currently a Professor with the School of Computer Science
coning detection: a systematic review, Comput. Secur. (2022) 102875. & Technology, Xidian University. His current research interests include wireless network
[43] K. Aryal, M. Gupta, M. Abdelsalam, A survey on adversarial attacks for malware and system security.
analysis, arXiv preprint arXiv:2111.08223, 2021.
[44] J. Li, X. Fu, S. Zhu, H. Peng, S. Wang, Q. Sun, S.Y. Philip, L. He, A robust and Qin Wang received Ph.D. degree from the University of New South Wales in 2022.
generalized framework for adversarial graph embedding, IEEE Trans. Knowl. Data He is now an assistant researcher in the University of New South Wales. His research
Eng. (2023). interests include privacy protection and blockchain.
[45] M. Gupta, C. Akiri, K. Aryal, E. Parker, L. Praharaj, From chatgpt to threatgpt: im-
pact of generative ai in cybersecurity and privacy, IEEE Access (2023).
Teng Li received the Ph.D. degrees in computer science from Xidian University, Xi’an,
[46] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: a lite bert
Shaanxi, China in 2018. He is currently an Associate Professor in the School of Cyber En-
for self-supervised learning of language representations, arXiv preprint arXiv:1909. gineering, Xidian University. His current research interests include wireless and networks,
11942, 2019. distributed systems and intelligent terminals with focus on security and privacy issues.
[47] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert:
smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108, 2019.
Ning Xi received the Ph.D. degrees in computer science from Xidian University, Xi’an,
[48] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, S. Jegelka, Representation
Shaanxi, China in 2014. He is currently a Professor with the School of Cyber Engineering,
learning on graphs with jumping knowledge networks, in: International Conference
Xidian University. His research interests include home network, service computing, and
on Machine Learning, in: PMLR, 2018, pp. 5453–5462. network security.
[49] Y. Liu, M. Jin, S. Pan, C. Zhou, Y. Zheng, F. Xia, S.Y. Philip, Graph self-supervised
learning: a survey, IEEE Trans. Knowl. Data Eng. 35 (6) (2022) 5879–5900.
[50] G. Dong, M. Tang, Z. Wang, J. Gao, S. Guo, L. Cai, R. Gutierrez, B. Campbel, L.E. Jianfeng Ma received the Ph.D. degree in computer software and telecommunication
engineering from Xidian University, Xi’an, China, in 1995. He is currently a Professor and
Barnes, M. Boukhechba, Graph neural networks in iot: a survey, ACM Trans. Sens.
a Ph.D. Supervisor with the Department of Computer Science and Technology, Xidian
Netw. 19 (2) (2023) 1–50.
University. He is also the Director of the Shaanxi Key Laboratory of Network and System
[51] Y. Fang, B. Yu, Y. Tang, L. Liu, Z. Lu, Y. Wang, Q. Yang, A new malware classifi-
Security. His current research interests include information and network security, wireless
cation approach based on malware dynamic analysis, in: Information Security and and mobile computing systems, and computer networks.
Privacy: 22nd Australasian Conference, ACISP 2017, Auckland, New Zealand, July
3–5, 2017, Proceedings, Part II, vol. 22, Springer, Auckland, New Zealand, 2017,
pp. 173–189.
11