0% found this document useful (0 votes)
56 views11 pages

Researchdemo 1

Uploaded by

devika Nair
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views11 pages

Researchdemo 1

Uploaded by

devika Nair
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Computers & Security 140 (2024) 103788

Contents lists available at ScienceDirect

Computers & Security


journal homepage: www.elsevier.com/locate/cose

DawnGNN: Documentation augmented windows malware detection using


graph neural network
Pengbin Feng a,∗ , Le Gai a , Li Yang b , Qin Wang c , Teng Li a , Ning Xi a , Jianfeng Ma a
a School of Cyber Engineering, Xidian University, Xi’an, Shaanxi, China
b School of Computer Science & Technology, Xidian University, Xi’an, Shaanxi, China
c University of New South Wales, Sydney, New South Wales, Australia

A R T I C L E I N F O A B S T R A C T

Keywords: Application Program Interface (API) calls are widely used in dynamic Windows malware analysis to characterize
Windows malware detection the run-time behavior of malware. Researchers have proposed various approaches to mine semantic information
Graph neural network from API calls to improve the performance of malware analysis. However, with increasingly sophisticated
BERT-based embedding
malware, the exploration of new semantic dimensions for API calls is never-ending. In this paper, we find that
Dynamic API call
the official Windows API documentation is an unexplored information source in malware detection. Therefore,
we propose a novel documentation-augmented Windows malware detection framework DawnGNN using the
pre-trained semantic enhanced mechanism and graph neural network. First, it converts the API sequences into
API graphs for further contextual information extraction. Next, we crawl API documentation from the official
website and employ the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to
encode functionality descriptions as API embeddings. Finally, it feeds the API graphs with API node attributes
into the Graph Attention Network (GAT) classifier to perform Windows malware detection. Moreover, we verify
the effectiveness of DawnGNN on three public datasets. Experimental results demonstrate the effectiveness of
DawnGNN. Semantic information from the official API documentation is promising in the Windows malware
detection domain.

1. Introduction operations, system calls, etc.) while running programs in an isolated


environment [6]. Compared with static analysis, the observation of ex-
Over the past decades, malware has been expanding rapidly in per- ecuted behavior makes dynamic analysis effective against various code
sonal computers and networks. According to a recent report [1], a total obfuscation techniques [7].
of 172,146 never-before-seen malware variants were identified in the The Windows API calls are widely used in dynamic malware de-
first six months of 2023 by SonicWall, more than in any other year tection [8–10]. A Windows program usually calls many system APIs
and an average of 956 per day. Malware would steal private data, during runtime, which characterizes all program behaviors including
perform unauthorized access, and cause system corruption, posing a file operation, network access, registry modification, etc. These APIs
serious threat to users. Therefore, it is necessary to devise an effective constitute API sequences that usually contain distinguishable contextual
automatic detection method for preventing the spread of malware, es- patterns for malware detection [7]. Thus, researchers have proposed
pecially the newly emerging variants. many machine learning or deep learning-based approaches that capture
Malware detection approaches can be mainly divided into static and the meaningful relationship information among API calls to perform
dynamic analysis. Static analysis methods directly extract specific fea- malware detection [11–13]. Unfortunately, most of these studies often
tures [2–4], such as header information, opcode sequences, and static only consider the API name or frequency of API usage but ignore se-
Application Program Interface (API) calls from executable files, but mantic information about the API calls, which cannot fully express the
packer, code obfuscation, and metamorphism techniques could make meaning of the API call sequences.
the static analysis less effective [5]. On the contrary, dynamic anal- Ce et al. [14] point out that the feature mining of API sequence is
ysis extracts behavior information (including network traffic, registry not sufficient, which would cause some malware to evade detection.

* Corresponding author.
E-mail addresses: pbfeng@xidian.edu.cn (P. Feng), xdutanzhe@gmail.com (L. Gai), yangli@xidian.edu.cn (L. Yang), qinwangtech@gmail.com (Q. Wang),
tengli@xidian.edu.cn (T. Li), nxi@xidian.edu.cn (N. Xi), jfma@mail.xidian.edu.cn (J. Ma).

https://doi.org/10.1016/j.cose.2024.103788
Received 11 September 2023; Received in revised form 24 January 2024; Accepted 23 February 2024
Available online 29 February 2024
0167-4048/© 2024 Elsevier Ltd. All rights reserved.
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788

Therefore, researchers have proposed to capture various information,


such as API semantic categories [14], API parameters [15], process
graphs [16], etc., for improving malware detection performance. Ce et
al. [14] propose to extract semantic information including category, ac-
tion, operation object, etc., from the API calls and construct semantic
chains from API sequences to improve malware detection performance.
DMalNet [17] proposes to combine semantic information from differ-
ent types of parameters and graph learning to improve the performance
of malware analysis. CruParamer [15] conducts fine-grained analysis
on API parameters and employs rule-based and clustering-based pa-
rameter classification to construct parameter-augmented API sequences
for further mining semantic information from parameters. MalPro [16]
proposes a logistic regression-based parameter weighting mechanism to
improve the semantics of API parameters and constructs process graphs
from behavior logs to enhance malware detection. Mal-Bert-GCN [18]
similarly leverages the BERT model to encode API sequences as node
embeddings for directed process graphs. Fig. 1. Desktop operating system market share worldwide from StatCounter.
When considering increasingly sophisticated malware, the explo-
ration of new semantic dimensions and detection frameworks is never-
coding mechanism in improving the performance of malware detection.
ending.
Experimental results also show that our proposed framework outper-
Insufficient Feature Mining. Except for considering API name and fre-
forms the existing detection methods only using raw API call sequences.
quency of API usage, researchers have proposed to leverage additional
In addition, we verify that the official API documentation is an effec-
information such as semantics within API name, pre-defined seman-
tive information source in Windows malware detection, which could be
tic categories, parameter value-induced sensitivity level, cross-process
complementary to current malware analysis methods. Our combination
interaction relationship, etc., for enhancing the current malware detec-
framework of graph learning and BERT-based encoding mechanism is
tion method. However, experts-defined sensitive semantics need regular
promising in Windows malware detection.
updates, which is labor intensive. Multi-stage feature processing on ex-
In summary, we make the following contributions:
isting detection features increases the burden at the inference stage.
Meanwhile, the exploration of new semantic features continues to be
• We find that the official API documentation is an unexplored and
necessary for the arms race between attack and defense. Inspired by the
effective information source for Windows malware detection and
software development process, we find that the official API documen-
design a documentation-augmented approach.
tation is a new semantic dimension, that could better characterize API
• We propose a novel dynamic Windows malware detection frame-
semantics and supplement existing detection features.
work, namely DawnGNN, that utilizes the BERT-based semantic
New Detection Framework. Graph neural network (GNN) has been
enhanced mechanism and graph neural network to perform mal-
proven to be effective in capturing critical information from program
ware detection.
representation graphs [19] within cybersecurity tasks. In addition, ap-
• We adopt multiple API embedding techniques and GNN algorithms
plying GNN algorithms directly on graph structures is superior to
to verify the effectiveness and explore the best performance of the
sequence-based and tree-based approaches in the vulnerability detec-
proposed detection framework.
tion domain [20]. The pre-trained Bidirectional Encoder Representa-
• We evaluate our approach on three public Windows malware
tions from Transformers (BERT) [21] model is widely used to encode
benchmarks. Experimental results verify the effectiveness of our
semantic information from natural languages, which could automati-
BERT-based semantic enhanced mechanism and design framework.
cally infer critical information from API documentation. The combi-
nation of intrinsic message-passing mechanisms within GNN algorithms
The rest of the paper is organized as follows. Section 2 presents
and BERT-derived critical information could help to identify potentially
the background about Windows malware and API. Section 3 detailed
risky behaviors.
describes the system design of DawnGNN. Section 4 discusses the ex-
In this paper, we focus on API-based dynamic malware analysis and
perimental results. Section 6 summarizes the related works. Section 5
try to explore additional semantic information from API sequences to
presents the limitation of DawnGNN. The conclusion of this work is
fight against increasingly sophisticated Windows malware. Inspired by
provided in Section 7.
the success of Android API documentation in the identification of source
and sink methods [22], we found that the official Windows API docu-
mentation is an unexplored information source in malware detection. 2. Background
With the proven effectiveness in various Natural Language Process-
ing (NLP) tasks, the pre-trained BERT model could be leveraged to In this section, we discuss the threats of Windows malware and in-
capture semantic information from the natural language described in troduce the API with official documentation.
API documentation. Consequently, we propose a novel Windows mal-
ware detection framework, documentation augmented Windows mal- 2.1. The threats of windows malware
ware detection using graph neural network, named DawnGNN. First,
it converts the API call sequences to API graphs for further extract- According to the StatCounter website statistical data, Windows, de-
ing the contextual information. Second, we designed a semi-automatic veloped by Microsoft, stands as the most widely used and widely dis-
method to crawl API documentation from the Microsoft official website. tributed desktop operating system [23]. The market share distribution
Next, the collected API documentation is inputted into the pre-trained for the desktop operating systems is shown in Fig. 1. Consequently, the
BERT model for encoding newly discovered semantic information as popularity and widespread usage of the Windows operating system (OS)
API node embeddings. Finally, the Graph Attention Network (GAT) make it an attractive target for cybercriminals. Malware targeting the
classifier takes the API graphs with node semantic attributes as in- Windows platform has increased enormously in recent years. According
put to perform Windows malware detection. On three public Windows to AV-TEST statistics [24], until September 2023, the count of Windows
malware benchmarks, we verify the effectiveness of our BERT-based en- malware samples has reached 1.07 billion.

2
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788

graph neural network and the BERT model. Fig. 3 shows the overall ar-
chitecture of DawnGNN, which consists of three components, namely,
API Graph Constructor, API2Vec Embedding Layer and GNN Classifier.
Firstly, the API graph constructor leverages Cuckoo Sandbox [27] to
perform automatic dynamic analysis on Windows Portable Executable
(PE) programs to extract API call sequences. API calls, indicating the
interactions between programs and system resource usage, are widely
used to make unified behavior representations for malware detection.
Then, it adopts structural dependencies within API sequences to build
API graphs. Next, the API2Vec embedding layer generates API attributes
by encoding official API documentation via the BERT-based language
model. Finally, GNN is adopted to learn contextual information from
attributed API graphs for performing effective malware detection.

Fig. 2. Official API documentation for Windows API Process32NextW. Text in-
3.1. API graph constructor
formation in red box denote API functionality description.

After obtaining the run-time API call sequences from Windows PE


Windows malware mainly includes the following categories: Virus.
programs, this component converts the call sequences into graphs to
A computer virus usually hides within another seemingly harmless pro-
capture the structural dependencies between APIs. Formally, given a set
gram and generates copies and inserts them into other programs. Worm.
of API sequences, each program is represented as a graph 𝐺 = {𝑉 , 𝐷},
A worm usually performs as stand-alone malware and actively prop-
where 𝑉 is a set of nodes and 𝑣 ∈ 𝑉 denotes a unique API call, and
agates itself via networks to infect other files or computers. Rootkits.
𝐷 ⊆ 𝑉 × 𝑉 represents a set of directed edges, where an edge 𝑣⃖⃖⃖⃖⃖⃖
𝑖 𝑣⃗
𝑗 indi-
Rootkits can remain hidden by altering the system settings of targeting
cate a sequential connection between API calls. The attribute matrix is
OS and making the harmful processes invisible to normal users. Back-
defined as 𝑋 = {⋯ ; 𝑥𝑖 ; ⋯ ; }, where 𝑥𝑖 is the attribute of 𝑖-th node. The
door. A backdoor allows attackers to gain unauthorized remote access to
adjacency matrix of graph 𝐺 is represented as 𝐴 ∈ ℤ𝑁𝐺 ×𝑁𝐺 , where 𝑁𝐺
a victim’s computer to bypass its protection mechanisms. Trojan horse. A
is the count of all API nodes. Thus, the API call graph maintains all API
Trojan horse usually camouflages as a regular, benign program or util-
calls with their sequence information.
ity to mislead victims and activates hidden destructive functions when
To construct the graph, we first collect API calls from all Windows
the application starts. Ransomware. Ransomware mainly focuses on de-
programs as an API sequence set. Then, we construct the API graph by
manding a ransom from the victims by encrypting key files or locking
treating the sequence orders as call relationships, which could be im-
the whole system to prevent access.
proved by precisely API parameter matching. To further illustrate the
process, Fig. 4 shows a segment of the collected API sequence from a
2.2. Windows API with official documentation
sample of Trojan malware [28], which is often disguised as legitimate
The Windows API is an important part of the Windows OS and plays software and performs unauthorized behaviors. The API sequence con-
a key role in connecting Windows-based programs with Windows ker- tains six API calls: ‘ntcreatefile’, ‘ldrgetprocedureaddress’,
nel and hardware [25]. The collection of all API functions is known ‘setfileattributesw’, ‘getfileattributesw’, ‘mesagebox-
as Win32 API. Apart from some console programs, all Windows pro- timeouta’ and ‘ntterminateprocess’. In the API sequence, the API
grams can interact with Windows API and access predefined tasks such call ‘ntcreatefile’ is followed by ‘mesageboxtimeouta’, which
as opening and closing a file, displaying a prompt dialog box, storing means a directed edge is created from the ‘ntcreatefile’ node to
computation results to files, and accelerating task processing via start- ‘mesageboxtimeouta’ node.
ing multiple threads. The system resources like file systems, processes,
threads, network communication, and devices, are unified and managed 3.2. API2Vec embedding layer
by the OS kernel, and programs need to employ Windows API to accom-
plish their tasks. All available API functions are provided via dynamic In this paper, we generate API embeddings via learning semantic
link libraries, i.e., in .dll files, and commonly used libraries include Ker- information from crawled official Windows API documentation. In ad-
nel32.dll, User32.dll, and GDI32.dll. The extraction and analysis of API dition, attributes like the parameters of the API function, the location
calls are useful in determining the behavior and functions of a program. of the API within its sequence, and code semantic characteristics within
We find that the official API documentation contains the seman- the API implementation, can be easily added to the node attribute to
tic description for API functionality and carries more information than further improve malware classification performance. The pre-trained
the API name, which can be used to enhance current API call-based BERT model [21] has been proven to be effective in various NLP tasks,
Windows malware detection methods. The API Process32NextW is one which is suitable for processing API documentation. Firstly, we design
representative potentially malicious API call [26]. Its partial official a semi-automatic method to crawl API documentation from the offi-
documentation is shown in Fig. 2. From Fig. 2, the sentence “Re- cial website1 via analyzing the website page layout structure. These
trieves information...” in the red box can accurately describe the API documents briefly summarize the functionality of every API in nature
functionality, which can also be used for semantic representation in language, which represents an effective information source for semantic
Windows malware detection. In this paper, we design a documentation- information extraction. Then, the BERT model has the ability to gener-
augmented Windows malware detection framework to verify the effec- ate API embeddings directly from these language descriptions.
tiveness of API documentation. In addition, we leave the inclusion of After manually analyzing the official API website, we found that
semantic information for other auxiliary descriptions within the docu- most websites share similar layout structures. Thus, we developed
mentation page like parameters, and return values in malware detection XML Path Language (XPath) parser scripts to extract all API names
as future work. with corresponding description documentation. To counter the complex
anti-crawl mechanism of the Microsoft website, we manually save the
3. System design website page covering all API categories. Next, we parse the category

The goal of DawnGNN is to leverage the official API documentation


1
information for enhancing dynamic Windows malware detection using https://learn.microsoft.com/en-us/windows/win32/api.

3
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788

Fig. 3. The architecture of DawnGNN system.

Fig. 4. The workflow of API graph constructor.

Fig. 6. Word cloud of Windows API documentation.

sents the beginning of sentences and sentence separation respectively.


In one API description, eighty percent of the chosen words are masked
by [MASK] (mask-out tokens), ten percent are kept unchanged, and
the remaining ten percent are replaced with other words (corrupted to-
kens). Then, the multi-layer bidirectional transformer encoder within
BERT processes the input, predicts the masked-out tokens, and outputs
a probability for a particular token 𝑡 = [MASK] via a fully connected
layer followed by the last transformer encoder. The cross-entropy loss
function is formalized as:
Fig. 5. Total number of words for Windows API documentation. 𝑀

𝐿𝑏𝑒𝑟𝑡 (𝜃𝑏 ) = − log(𝑚 = 𝑚𝑖 |𝜃𝑏 ), 𝑚𝑖 ∈ [1, 2, ..., |𝑀|], (1)
𝑖=1
website to collect every API category Uniform Resource Locator (URL).
Then, the API category URL is iteratively parsed to collect header file where 𝜃𝑏 represents the parameters of the transformer encoder and
URLs for API documentation extraction. Finally, we obtain the most API output layer within BERT, and 𝑀 denotes the collection of masked
documentation without sparse distribution cases, like Windows driver tokens during the training phase. In each self-attention layer within
API or out-of-date API. For the remaining API, we manually search on- BERT, an input token updates its embedding by computing the attention
line to collect the description. Finally, we crawled 32763 Windows APIs weights with other connected tokens’ embeddings. In this way, each to-
with corresponding description documentation. We plot the distribu- ken’s embedding captures context-sensitive semantic information and
tion of the number of words for Windows API documentation in Fig. 5, changes with its location and context. Thus, the BERT model can learn
which exhibits great diversity and contains sufficient information. We the semantic information of API documentation. When pre-training is
also plot the word cloud of Windows API documentation in Fig. 6, which completed, we input the API’s corresponding official functionality de-
covers the main functionality provided by Windows OS. scription into the BERT model and treat the hidden state of the last layer
After building the API documentation corpus, we leverage the pre- as the semantic embedding of that API.
trained BERT model to capture the features and encode the semantic Specifically, the original API documentation contains additional
representation of each Windows API. In order to learn the context re- words such as the API name “The NotifyAddrChange function...”, abbrevi-
lationships between different words within the API documentation, we ations “(ARP)”, annotations “(Unicode)”, etc., which are meaningless to
perform the masked language model (MLM) task. We present the de- the functional description and are not represented in nature language.
tailed process of the MLM task for API documentation in Fig. 7. The We therefore remove these additional words to ensure that the BERT
[CLS] and [SEP] tags are added to the API documentation, which repre- model could accurately capture the API semantic. Alternatively, we

4
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788

Fig. 7. Pre-trained BERT with MLM task for API documentation.

Fig. 8. API Graph Structure Modeling with GAT.

leverage the one-hot encoding and Word2vec static embedding mech- based message-passing mechanism. This attention mechanism allows
anism to generate API node attributes directly from the collected API GAT to adaptive allocate attention weights to neighbor nodes. Next, it
sequences. Next, we compare the BERT encoding mechanism with the leverages the weighted sum of neighbor nodes to update the represen-
above two mechanisms to highlight the importance of semantic infor- tation of the current node. In addition, GAT has the advantage of strong
mation. generalization for directed graphs.
The network structure of the GAT is shown in Fig. 8. As shown in
3.3. GNN classifier Fig. 8, at first, each API’s official documentation within the API graph
is fed into BERT to extract semantic embedding. Then, the API node
After the processing of the API2Vec Embedding Layer component, we embeddings and API graph structure are used as the input of GAT to
obtain a number of dynamic API graphs with corresponding node at- compute graph embedding with structure and semantic information.
tributes. Then, the GAT [29] classifier is trained on these API graphs During the iterative process of every layer within GAT, the semantic
to extract the structural information and further perform Windows mal- embedding of an API node is passed to its neighbor nodes. With the
ware detection. GAT is a graph neural network based on an attention- help of the multi-head attention mechanism, each API node can focus

5
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788

on more critical neighbor nodes. Given two connected API nodes 𝑖 and Table I
𝑗 , the attention weight 𝛼 at attention head 𝑡 and 𝑙 -th layer structure is Three datasets of API calls for experimental evaluation.
calculated according to the following formula: # of Malicious # of Benign Released
Dataset
Samples Samples Date
exp(ReLU(𝐹𝑎 [𝑊𝑡 𝐻𝑖𝑙 ||𝑊𝑡 ℎ𝑙𝑗 ]))
𝑡
𝛼𝑖,𝑗 =∑ , (2) MalBehavD-V1 [26] 1,285 1,285 2022
𝑐∈𝑅𝑖 exp(ReLU(𝐹𝑎 [𝑊𝑡 𝐻𝑖𝑙 ||𝑊𝑡 ℎ𝑙𝑐 ])) PE_APICALLS [31] 452 101 2019
APIMDS [32] 23,080 300 2015
where ℎ𝑙 denotes the hidden representation of an API node at the 𝑙 -
th layer, ℎ0 equals the semantic embedding of an API node generated
by the BERT model, 𝑊𝑚 represents the learnable parameters at 𝑡-th Table II
attention head, 𝐹𝑎 is a feedforward neural network, ReLU indicate the Statistics of generated API graphs on three datasets.
rectifier activation function, || denotes the concatenation operation, and Dataset label Avg. # of Nodes Avg. # of Edges
𝑅𝑖 represents the neighbor nodes of API node 𝑖. Then, the update pro-
MalBehavD-V1 malware 41.54 34.61
cess of each API node’s embedding based on the attention mechanism benign 42.61 31.22
is formalized as follows:
PE_APICALLS malware 37.87 29.19

ℎ𝑙+1 = ‖𝑇𝑡=1 𝜎( 𝑡
𝑊𝑡 ℎ𝑙𝑗 ),
benign 19.64 28.10
𝑖 𝛼𝑖,𝑗 (3)
𝑗∈𝑅𝑖 APIMDS malware 108.37 30.27
benign 42.68 31.68
where 𝑇 denotes the number of attention heads, 𝑡 represents the 𝑡-th
attention head. Finally, GAT updates the API node embeddings of the
API graph and sums the graph semantic embedding 𝑠𝐺 as follows: i7-12700 CPU @ 2.10 GHz, 16.0 GB RAM, NVIDIA RTX 3060, and 512
𝑁𝐺 GB for the hard disk drive. DawnGNN was implemented in Python pro-
∑ gramming language version 3.8.10 with PyTorch 2.0.0 and Transform-
𝑠𝐺 = ℎ𝐿
𝑖 . (4)
𝑖=0 ers 4.28.1 framework and other libraries such as Scikit-learn, Numpy,
Pandas, and Requests have been also used. The framework takes se-
The final prediction classification is performed via a Multilayer Percep-
quences of API calls extracted from Windows exe files as input.
tron (MLP) model, which can be represented as:
We collected three existing datasets of malicious and normal API
calls for our experimental evaluation. The information of these datasets
𝑌 = MLP(𝑠𝐺 |𝜃𝑚𝑙𝑝 ), (5)
is summarized in Table I. As described in Section 3.1, we generate API
where 𝜃𝑚𝑙𝑝 denotes the learnable parameters of MLP model, 𝑌 repre- graphs based on collected run-time API call sequences. The statistics
sents the final classification label malware or benign. of generated API graphs on three datasets are shown in Table II. From
In this paper, we leverage the GNN model to generate graph em- Table II, we can observe that the API graphs are sparse forms, which
bedding [30] via encoding all node hidden representations and graph is suitable for graph classification tasks. Using different datasets allows
structure information into low-dimensional space. The node hidden us to evaluate the malware detection performance of DawnGNN from
representation is transformed from aggregating local neighbor node in- multiple dimensions. Specifically, we randomly shuffle the dataset and
formation. DawnGNN also adopts Graph Convolutional Network (GCN) split 80% for the training, 10% for validation, and the rest 10% for
and Graph Isomorphism Network (GIN), and compares their perfor- testing.
mance to identify the most effective mechanism in malware detection.
GCN is another representative GNN, where node hidden representation 4.2. Measure metrics
is calculated via the following formulas:
We evaluate the Windows malware detection performance of
̂ 𝑙𝑊 𝑙)
𝐻 𝑙+1 = ReLU(𝐴𝐻 (6) DawnGNN with the following five metrics: precision, recall, true nega-
where 𝐻 𝑙 indicates the hidden representation matrix at the 𝑙 -layer for tive rate, accuracy, and F1-score. These metrics are computed via true
all nodes, and 𝐻 0 denotes the all API node embeddings generated by positive (TP), true negative (TN), false positive (FP), and false negatives
the API2Vec Embedding Layer component. 𝑊 𝑙 is the learnable weight (FN). In the Windows malware detection scene, TP denotes the count
1 1 of correctly identified malicious exe files, and TN denotes the count of
parameters of the 𝑙 -layer GCN. 𝐴̂ = 𝑀̃ − 2 𝐴̃ 𝑀
̃ − 2 , where 𝑀
̃ denotes the
correctly detected benign exe files. FP indicates the count of misidenti-
̃
degree matrix, and 𝐴 = 𝐴 + 𝐼𝑠 . 𝐼𝑠 is the identity matrix. GIN adopts an
fied malicious exe files, and FN denotes the count of missed malicious
MLP model to aggregate comprehensive information, which is formal-
exe files. The above measure metrics are calculated as follows:
ized as:
∑ TP
Precision = (8)
ℎ𝑙+1
𝑖 = MLP𝑙+1 ((1 + 𝜖 𝑙 )ℎ𝑙𝑖 + ℎ𝑙𝑗 ), (7) TP + FP
𝑗∈𝑅(𝑖) TP
Recall = (9)
TP + FN
where 𝜖𝑡 represents scalar learnable parameters.
TN
True negative rate (TNR) = (10)
TN + FP
4. Experiments and evaluation
TP + TN
Accuracy (Acc) = (11)
TP + TN + FP + FN
In this section, we comprehensively evaluate our proposed system
2 × Precision × Recall
DawnGNN via various experiments. In the following, we first describe F1-score (F1) = (12)
Precision + Recall
the experiment settings and dataset used in DawnGNN. And then, we
discuss the results of our experiments. 4.3. Performance of malware detection

4.1. Experimental setup and dataset In this paper, DawnGNN performs Windows malware detection via
graph neural network and BERT-based semantic enhanced mechanism.
The proposed framework DawnGNN was implemented and tested Therefore, in this experiment, we comprehensively evaluate the effec-
on a computer running Ubuntu 20.04 (64-bit) with Intel(R) Core (TM) tiveness of the two main components.

6
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788

Table III Table IV. From Table IV, we can observe our BERT-based encoding
Detection performance comparison with different detection methods in mechanism improves the malware detection performance in the two
dataset MalBehavD-V1. imbalanced datasets via semantic information extracted from API doc-
Detection method Precision Recall TNR Acc F1 umentation. In the two datasets, the TNR is lower when compared with
Precision and Recall. This is caused by that the count of the malware
one-hot + RF 0.9009 0.9494 0.8908 0.9203 0.9231
one-hot + LSTM 0.9150 0.9469 0.9106 0.9283 0.9295 is far beyond the count of the benign. In addition, the model cannot
one-hot + GAT 0.9195 0.9586 0.9074 0.9339 0.9368 characterize benign patterns without enough samples.
Word2Vec + LSTM 0.9323 0.9661 0.9277 0.9441 0.9475
Word2Vec + GAT 0.9559 0.9532 0.9436 0.9432 0.9543 4.4. Comparison of GNN algorithms
BERTsmall + LSTM 0.9609 0.9584 0.9512 0.9508 0.9595
BERTsmall + GAT 0.9667 0.9756 0.9527 0.9607 0.9683
BERTbase + LSTM 0.9632 0.9615 0.9553 0.9535 0.9618 In this section, we explore the influence of multiple GNN algorithms
BERTbase + GAT 0.9697 0.9788 0.9556 0.9638 0.9711 on Windows malware detection performance.
The BERTsmall and BERTbase represent different versions small and base
We tune hyper-parameters that significantly affect the detection
of the pre-trained BERT encoding mechanism. performance according to domain knowledge to choose the optimal
detection performance. The experiment is conducted on the dataset
MalBehavD-V1 to eliminate the interference of imbalanced samples.
Firstly, we compare the BERT-based encoding mechanism with one-
The search range of hyper-parameters and optimal values for three GNN
hot and Word2Vec-based encoding methods to highlight the effective-
models are shown in Table V. From Table V, we can observe that the
ness of semantic information extracted from API documentation. In the
three GNN algorithms achieve the best performance at 100 epochs with
one-hot encoding method, each API call is transformed into a binary
a batch size of 128 and a learning rate of 0.0001. The number of graph
vector where each position represents a unique API call. The dimension
neural network layers is all in multi-layers when generating the best
of the binary vector is equal to the count of all collected APIs. In the
Word2Vec-based encoding method, we treat each API call as a word performance. The optimal hidden dimensions for the three models are
and use a neural network model to learn word associations from a large 32 for GCN, 16 for GIN, and 12 for GAT The neural networks’ hidden
set of API sequences. dimensions achieve the best performance when GCN is 32, GIN is 16,
Next, we compare the graph feature-based detection method with and GAT is 12.
the statistical and sequence feature-based detection methods to illus- We compare the detection performance of DawnGNN by using three
trate the effectiveness of the design of graph feature learning. We representative GNN algorithms: GCN, GIN, and GAT. The detection per-
combine the one-hot vector representation of API call sequences with formance comparison of three GNN algorithms is shown in Table VI.
the Random Forest (RF) model as the statistical feature-based detection From Table VI, we can observe that GAT provides the best detection
method. We leverage the long short-term memory (LSTM) model to han- performance. The experiment results show that DawnGNN with GIN ex-
dle API sequences, which constructs the typical sequence feature-based hibits superior detection performance over GCN, which illustrates that
detection method. In particular, the LSTM model requires API encoding the adaptation of the powerful message aggregation function MLP leads
mechanisms before inputting API sequences, which could be one of the to the improvement of GIN when compared with GCN. The GAT algo-
three encoding mechanisms. rithm employs an attention mechanism to adaptive aggregate important
Specifically, we compare the detection methods mentioned above information from neighbor nodes. In addition, the ability to handle di-
in the symmetric dataset MalBehavD-V1. We set the Word2Vec and RF rected graphs makes GAT more suitable for API graph scenes. Therefore,
algorithms in the default setting. For the BERT-based encoding mech- the DawnGNN with GAT achieves the best performance when compared
anism, we select the base and small versions according to our ex- with GIN and GCN. We also plot the detection performance variation
periment environment. Specifically, we perform the MLM task on the rule with epochs in Fig. 9 to observe fluctuations in detection per-
collected official functionality descriptions with bert_base_uncased and formance. From Fig. 9, we can observe that our proposed Windows
bert_small as the initial pre-trained model and extract the last hidden malware detection framework achieves good performance on all three
layer as the API embedding. The dimensions of the generated API em- GNN algorithms. This illustrates that the combination of GNN models
beddings are 512 and 768, respectively. As there are multiple versions and BERT-based enhancing mechanisms is promising in Windows mal-
of the BERT model and many variant models, such as RoBERTa [33] and ware detection.
SENTENCE-BERT [34], we leave the exploration of the optimal encod-
ing mechanism as a future work. For LSTM, we refer to the parameter 4.5. Comparison with other approaches
settings within the existing detection method [35].
The Windows malware detection performance with different encod- In the following, we compare DawnGNN with existing Windows
ing mechanisms and feature structures are shown in Table III. From malware detection approaches to verify the effectiveness of our pro-
Table III, we can observe that our BERT-based semantic enhanced posed detection performance on public datasets.
mechanism improves the malware detection performance under every We examine the performance of the DawnGNN framework against
type of learning model. This illustrates that API documentation con- other existing detection approaches based on API call sequences ex-
tains rich semantic information for identifying Windows malware. The tracted from exe files and comparative results are presented in Ta-
BERT in base version outperforms the small version, which illustrates ble VII. First, we compare DawnGNN against MalDy [36] and MalDet-
that larger API embeddings carry more precision semantic information Conv [26] on dataset MalBehavD-V1 and PE_APICALLS. MalDy pro-
in our API documentation encoding case. In addition, the pre-trained posed to leverage h-grams, feature hashing, and Term Frequency–
BERT model has the ability to extract context-sensitive information Inverse Document Frequency (TF-IDF) to vectorize the behavior reports.
from API documentation. From Table III, we can also observe that the Then, an ensemble prediction framework is constructed to perform
graph feature learning method is superior to the statistical and sequence precise malware detection. MalDetConv designed a new automated
feature-based methods, in each of the three encoding mechanisms. This behavior-based detection framework, which constructs a hybrid of Con-
illustrates that the graph feature-based method considers the struc- volutional Neural Network (CNN) and Bidirectional Gated Recurrent
ture information, improving the detection performance compared to the Unit (BiGRU) models to perform high dimensional representations of
method that only considers sequence or statistical information. API call sequences and then leverages a fully connected neural network
We also evaluate the effectiveness of DawnGNN with different en- module for malware detection. On dataset MalBehavD-V1, DawnGNN
coding mechanisms in datasets PE_APICALLS and APIMDS. The Win- achieved the detection accuracy of 0.9638, creating an improvement
dows malware detection performance in the two datasets is shown in of 0.79% and 0.51% detection accuracy of MalDy and MalDetConv,

7
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788

Table IV
Detection performance comparison with different encoding mechanisms in
dataset PE_APICALLS and APIMDS.

Dataset Encoding Precision Recall TNR Acc F1

PE_APICALLS one-hot 0.9093 0.9808 0.7246 0.9070 0.9283


Word2Vec 0.9214 1.0000 0.7917 0.9286 0.9423
BERT 0.9722 1.0000 0.8929 0.9762 0.9855

APIMDS one-hot 0.9475 1.0000 0.8013 0.9524 0.9714


Word2Vec 0.9861 1.0000 0.8413 0.9878 0.9928
BERT 0.9969 1.0000 0.8714 0.9975 0.9984

Table V
Search range of hyper-parameters and optimal values for three GNN models.

Hyper-parameters GCN GAT GIN Search Range

Learning rate 0.0001 0.0001 0.0001 {0.01, 0.001, 0.0001, 0.00001}


Batch size 128 128 128 {32, 64, 128, 256, 512}
Epochs 100 100 100 {40, 60, 80, 100, 200, 300}
Hidden layers 2 2 2 {2, 3, 4, 5}
Hidden dim 32 12 16 {8, 16, 32, 64, 128} for GCN and GIN
Attention heads - 12 - {6, 8, 10, 12, 14, 16} for GAT

Fig. 9. Detection performance comparison of three GNN models with epochs.

Table VI effective for detecting Windows malware, there is still improvement


The detection performance comparison with three GNN models. in our current implementation when compared with a large number
Algorithms Precision Recall TNR Acc F1 of existing detection approaches. DawnGNN directly constructs API
graphs from API call sequences to capture the structural dependencies.
GCN 0.9401 0.9666 0.9388 0.9406 0.9524
GIN 0.9506 0.9701 0.9469 0.9509 0.9603
This coarse-grained graph construction mechanism can be improved
GAT 0.9697 0.9788 0.9556 0.9638 0.9711 by DMalNet [17], which builds precise API call graphs using addi-
tional API parameter matching. DawnGNN identifies malware via the
unusual call contexts, which rarely occur among benign programs. For
respectively. The detection improvement of 2.44% and 1.89% were
indistinguishable call contexts, additional run-time behavior informa-
achieved by DawnGNN over the MalDy and MalDetConv framework us-
tion like API parameters and network activities [38] can be adopted to
ing dataset PE_APICALLS, respectively. This illustrates the effectiveness
improve detection performance. Malware could leverage concealment-
of our BERT-based semantic enhancement and graph learning combi-
based evasion method [39], such as environment analysis, delayed exe-
nation design in the Windows malware detection domain. On dataset
cution, conditional execution, etc., and disguise as benign programs to
APIMDS, most methods already achieve the detection accuracy of more
evade dynamic analysis. This could be mitigated by X-Force [40], which
than 99%. Although MalDetConv achieves the highest detection accu- leverages the forced execution technique to increase the coverage of dy-
racy, DawnGNN also performs highly precise malware detection. In this namic analysis. In addition, APT malware leverage Living-Off-The-Land
paper, we design a documentation-augmented mechanism to verify the techniques [41] to conduct nefarious actions, which could be resisted
effectiveness of the semantic information extracted from the official API by provenance graph-based APT detection methods [42]. Our graph
documentation. Experiment results from Table VII illustrate that our learning-based detection framework is inherently affected by adversar-
proposed DawnGNN reaches state-of-the-art malware detection perfor- ial attacks [43], which could be fought against by more robust learning
mance. methods [44].
5. Discussion & limitation Except for API category and parameter information [14–16], the
official API documentation is another unexplored information source.
In this paper, we focus on dynamic API-based malware detection Therefore, we crawl API documentation from Microsoft’s official web-
and leverage API documentation information to enhance the Windows site and leverage the BERT model to extract semantic information,
malware detection performance. Although our DawnGNN is useful and which could be complementary to existing Windows malware detec-

8
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788

Table VII
Detection performance comparison with existing approaches.

Dataset Approaches Behavior Feature Feature Vectorization Method ML/DL Algorithm Detection Accuracy

MalBehavD-V1 MalDy [36] Behavior reports N-grams+ Feature Ensemble learning 0.9559
from analysis sandbox Hashing+ TF-IDF
MalDetConv [26] API call sequences Word2vec Hybrid model 0.9610
CNN-BiGRU
DawnGNN API call sequences BERT-based semantic GAT 0.9638
+ API documentation enhancement

PE_APICALLS MalDy [36] Behavior reports N-grams+ Feature Ensemble learning 0.9518
from analysis sandbox Hashing+ TF-IDF
MalDetConv [26] API call sequences Word2vec Hybrid model 0.9573
CNN-BiGRU
DawnGNN API call sequences BERT-based semantic GAT 0.9762
+ API documentation enhancement

APIMDS Amer and Zelinka [7] API call sequences Word2Vec+ clustering Markov chain 0.9990
similarity model
Ki et al. [32] API call sequences DNA sequence Similarity 0.9980
alignment matching
Tran and Sato [37] API call sequences TF-IDF SVM 0.9619
MalDetConv [26] API call sequences Word2vec Hybrid model 0.9993
CNN-BiGRU
DawnGNN API call sequences BERT-based semantic GAT 0.9975
+ API documentation enhancement

tion approaches. Mal-Bert-GCN [18] built directed process graphs from determines whether it has malicious behaviors by examining API calls,
raw API sequences and employed GCN to perform malware detection. network traffic, and other critical characteristics. API calls can pro-
Similarly, our BERT-based semantic enhanced mechanism is orthogonal vide valuable run-time information for identifying malicious activities.
to inter-process interaction information. In this paper, we only extract Therefore, researchers have proposed a lot of API call-based detection
the functional descriptions for an API call to extract semantic informa- approaches.
tion. Other numerous descriptions, including parameters, return value, Researchers have focused on extracting more effective features from
remarks, and requirements, are potential useful information sources API call sequences to perform malware detection for a long time. Fang
for malware detection when considering the boom in large language et al. [51] use a hash function to encode the API call names, return
models [45]. We leave the design of a more comprehensive detection values, and module names for obtaining more detailed behavior infor-
framework via deeply digging semantic information from API documen- mation. Agrawal et al. [52] perform one-hot encoding on API call se-
tation as future work. quences and n-gram encoding on API parameters. Unfortunately, these
Our documentation-augmented malware detection framework is approaches only consider partial parameters or treat all parameters as
based on the classic BERT-based encoding mechanism and standard strings, which cannot fully explore the information from various param-
GNN algorithm. The encoding mechanism could be improved by the eters. Zhang et al. [10] employ different hashing strategies to encode
large version or other optimized models, such as RoBERTa, AL- API names and various parameters, which still cannot express semantic
BERT [46], DistillBERT [47], etc. The standard GNN algorithm could be information. Rabadi and Teo [53] divide the API parameters into multi-
improved by jumping knowledge networks [48], self-supervised learn- ple representation sets before applying feature hashing, which may lead
ing mechanisms [49], or other superior graph learning algorithms [50]. to the loss of distinction.
We leave the exploration of the optimal encoding mechanisms and Many studies have applied ML and DL models to analyze API call se-
graph learning models as future work for performing highly precision quences. Qiao et al. [54] leverage frequent itemset mining and similar-
malware detection. We verify the effectiveness of the proposed seman- ity calculation to process the API names and parameters within API call
tic enhanced mechanism on three publicly available dynamic malware sequences. Uppal et al. [8] select the important feature via frequency
datasets. We plan to make stronger experiment validation when obtain- statistics from API call sequences, and then employ the Support Vec-
ing more recent and diverse datasets. The semantic enhanced mecha- tor Machine (SVM) classifier to perform malware detection. The above
nism could be complementary to enhance current Windows malware methods ignore the relationship between API calls and can be easily
detection methods. We leave the enhancement validation as future re- evaded by modifying the frequency counter value. Daht et al. [55] em-
search with open source or re-implementation versions of current de- ploy n-grams to process the system API call sequences and then leverage
tection methods. In addition, the combination design of graph neural the logistic regression and shallow neural network classifier to perform
network and BERT-based semantic enhanced mechanism can also be malware classification. Ndibanje et al. [56] construct feature vectors
applicable to malware detection in other platforms, like Android and from API sequences and employ similarity-based statistics methods to
Linux. The design of bringing in external software operation knowledge detect malware. Zhang et al. [57] construct API relationship graphs
is promising in the malware detection domain. to represent the internal relationships among various programs. Then,
they leverage the knowledge graph embedding algorithm to input the
6. Related work
API graphs into RF, Model Pool, SVM, and Deep Neural Network (DNN)
This section provides a detailed discussion of relevant literature on to perform malware detection. Pascanu et al. [58] employ Recurrent
dynamic Windows malware detection and BERT-based security detec- Neural Networks (RNNs) to capture sequence relations between APIs
tion methods. and feed the outputs of RNNs into a max-pooling layer for malware clas-
sification. kolosnjaji et al. [9] leverage CNN to process consecutive API
6.1. Dynamic windows malware detection sequences and apply LSTM to handle time-series dependence. Agrawal
et al. [52] propose to construct several stacked LSTMs to process API
The dynamic method executes a program in a controlled environ- names and string parameters. Zhang et al. [10] design a hybrid deep
ment which observes the execution status of the program and then learning framework including gate-CNNs and Bi-LSTM to process API

9
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788

names and parameters for performing malware detection. Researchers Declaration of competing interest
try to transform the API call sequences into graphs to capture the direct
or indirect relationship between API calls. Jiang et al. [59] transform The authors declare that there are no conflicts of interest regarding
the API sequences of exe files into a call graph by matching the caller- the publication of this article. All authors have contributed to, read, and
callee relationships. Then, graph embedding techniques and stacked approved this submitted manuscript in its current form.
denoising autoencoders are combined to perform malware detection.
Amer and Zelinka [7] leverage contextual similarity in API sequences Data availability
to cluster APIs, and employ the Markov chain to capture relationships
and perform malware detection. In this paper, we combine the GNN Data will be made available on request.
model and BERT-based semantic enhancement mechanism to classify
API graphs and perform effective Windows malware detection. Acknowledgements

6.2. BERT-based security detection This research was funded by the Major Research plan of the Na-
tional Natural Science Foundation of China (Grant No. 92267204), the
BERT has emerged as a powerful natural language processing model Natural Science Basic Research Program of Shaanxi (Program No. 2023-
that is capable of learning bidirectionally contextual representations. JC-QN-0759, 2022JM-338), and the Fundamental Research Funds for
With great success in various language tasks, the BERT model has been the Central Universities (Project No.: XJSJ23184).
applied in security detection domains, such as malware detection [60],
vulnerability detection [61], and malicious traffic detection [62], etc. References
MalBert [63] and Badr et al. [60] design a BERT-based framework to
perform Android malware detection and classification from elements [1] SonicWall, Mid-year update to the 2023 sonicwall cyber threat report, https://www.
sonicwall.com/2023-mid-year-cyber-threat-report/. (Accessed 23 January 2024).
extracted from Android Manifest file. SmartConDetect [61] proposes to
[2] J. Singh, J. Singh, Detection of malicious software by analyzing the behavioral arti-
extract code fragments via a static analysis tool and then feed them facts using machine learning algorithms, Inf. Softw. Technol. 121 (2020) 106273.
into a pre-trained BERT model to perform vulnerability detection in [3] Z. Sun, Z. Rao, J. Chen, R. Xu, D. He, H. Yang, J. Liu, An opcode sequences analysis
smart contracts. BINSHOT [64] designs a BERT-based similarity learn- method for unknown malware detection, in: Proceedings of the 2019 2nd Interna-
ing architecture to perform effectively binary code similarity detection. tional Conference on Geoinformatics and Data Analysis, 2019, pp. 15–19.
[4] D. Yuxin, Z. Siyi, Malware detection based on deep learning algorithm, Neural Com-
The architecture adopts a weighted distance vector with a binary cross put. Appl. 31 (2019) 461–472.
entropy as a loss function. ET-BERT [62] proposes a new BERT-based [5] W. Han, J. Xue, Y. Wang, L. Huang, Z. Kong, L. Mao, Maldae: detecting and explain-
encrypted traffic representation model, which could capture deep con- ing malware based on correlation and fusion of static and dynamic characteristics,
textualized datagram-level representation from large-scale unlabeled Comput. Secur. 83 (2019) 208–233.
[6] Z. Salehi, A. Sami, M. Ghiasi, Maar: robust features to detect malicious activity based
data and perform effective malicious traffic detection. Enimanal [65]
on api calls, their arguments and return values, Eng. Appl. Artif. Intell. 59 (2017)
proposes a specialized BERT model, which leverages the declarations 93–102.
within the Linux manual to extract semantic information for each sys- [7] E. Amer, I. Zelinka, A dynamic windows malware detection and prediction method
tem call, and then utilizes GNNs to perform cross-architecture IoT based on contextual understanding of api call sequence, Comput. Secur. 92 (2020)
malware analysis. CoDOC [22] proposes a fusion framework to accu- 101760.
[8] D. Uppal, R. Sinha, V. Mehra, V. Jain, Malware detection and classification based
rately identify sensitive Android source and sink methods. This frame-
on extraction of api sequences, in: 2014 International Conference on Advances in
work leverages graph learning to encode source code information and Computing, Communications and Informatics (ICACCI), IEEE, 2014, pp. 2337–2342.
a BERT-based model to extract semantic information from Android [9] B. Kolosnjaji, A. Zarras, G. Webster, C. Eckert, Deep learning for classification of
documentation. Similarly, in this paper, DawnGNN crawls the official malware system call sequences, in: AI 2016: Advances in Artificial Intelligence:
29th Australasian Joint Conference, Hobart, TAS, Australia, December 5-8, 2016,
Windows API documentation and leverages a BERT-based semantic ex-
Proceedings, vol. 29, Springer, 2016, pp. 137–149.
traction mechanism to enhance malware detection. [10] Z. Zhang, P. Qi, W. Wang, Dynamic malware analysis with feature engineering and
feature learning, Proc. AAAI Conf. Artif. Intell. 34 (01) (2020) 1210–1217.
7. Conclusion [11] M. Fan, J. Liu, X. Luo, K. Chen, Z. Tian, Q. Zheng, T. Liu, Android malware familial
classification and representative sample selection via frequent subgraph analysis,
In this paper, we propose a novel dynamic Windows malware de- IEEE Trans. Inf. Forensics Secur. 13 (8) (2018) 1890–1905.
[12] Z. Lin, F. Xiao, Y. Sun, Y. Ma, C.-C. Xing, J. Huang, A secure encryption-based
tection system using graph neural networks and a BERT-based semantic malware detection system, KSII Trans. Int. Inf. Syst. 12 (4) (2018) 1799–1818.
enhancement mechanism, called DawnGNN. It constructs API graphs [13] F.O. Catak, A.F. Yazı, O. Elezaj, J. Ahmed, Deep learning based sequential model for
directly from API call sequences and leverages the BERT-based model malware analysis using windows exe api calls, PeerJ Comput. Sci. 6 (2020) e285.
to extract API semantic information from official API documentation. [14] C. Li, Q. Lv, N. Li, Y. Wang, D. Sun, Y. Qiao, A novel deep framework for dynamic
malware detection based on api sequence intrinsic features, Comput. Secur. 116
By feeding the semantic information of API nodes and the directed API
(2022) 102686.
graphs into the GAT, DawnGNN performs effective Windows malware [15] X. Chen, Z. Hao, L. Li, L. Cui, Y. Zhu, Z. Ding, Y. Liu, Cruparamer: learning on
detection. On three public datasets, we verify that our BERT-based en- parameter-augmented api sequences for malware detection, IEEE Trans. Inf. Foren-
coding mechanism improves the detection mechanism compared with sics Secur. 17 (2022) 788–803.
one-hot and Word2Vec-based encoding mechanisms, and DawnGNN [16] X. Chen, Y. Tong, C. Du, Y. Liu, Z. Ding, Q. Ran, Y. Zhang, L. Cui, Z. Hao, Malpro:
learning on process-aware behaviors for malware detection, in: 2022 IEEE Sympo-
outperforms other traditional detection methods only using raw API call sium on Computers and Communications (ISCC), IEEE, 2022, pp. 01–07.
sequences. In addition, we find that the official API documentation is [17] C. Li, Z. Cheng, H. Zhu, L. Wang, Q. Lv, Y. Wang, N. Li, D. Sun, Dmalnet: dynamic
an unexplored informative source and the BERT-based documentation malware analysis based on api feature engineering and graph learning, Comput.
augmented mechanism is promising in Windows malware detection. Secur. 122 (2022) 102872.
[18] Z. Ding, H. Xu, Y. Guo, L. Yan, L. Cui, Z. Hao, Mal-bert-gcn: malware detection
by combining bert and gcn, in: 2022 IEEE International Conference on Trust, Se-
CRediT authorship contribution statement curity and Privacy in Computing and Communications (TrustCom), IEEE, 2022,
pp. 175–183.
Pengbin Feng: Methodology, Software, Writing – original draft, In- [19] H. Gao, S. Cheng, W. Zhang, Gdroid: Android malware detection and classification
vestigation. Le Gai: Software, Visualization, Writing – original draft. with graph convolutional network, Comput. Secur. 106 (2021) 102264.
Li Yang: Data curation, Software. Qin Wang: Methodology, Writing [20] J.K. Siow, S. Liu, X. Xie, G. Meng, Y. Liu, Learning program semantics with code
representations: an empirical study, in: 2022 IEEE International Conference on Soft-
– original draft, Software. Teng Li: Methodology, Writing – review & ware Analysis, Evolution and Reengineering (SANER), IEEE, 2022, pp. 554–565.
editing. Ning Xi: Supervision, Writing – review & editing. Jianfeng [21] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: pre-training of deep bidirectional
Ma: Funding acquisition, Supervision, Writing – review & editing. transformers for language understanding, arXiv preprint, arXiv:1810.04805, 2018.

10
P. Feng, L. Gai, L. Yang et al. Computers & Security 140 (2024) 103788

[22] J. Samhi, M. Kober, A.K. Kabore, S. Arzt, T.F. Bissyandé, J. Klein, Negative results of [52] R. Agrawal, J.W. Stokes, M. Marinescu, K. Selvaraj, Neural sequential malware
fusing code and documentation for learning to accurately identify sensitive source detection with parameters, in: 2018 IEEE International Conference on Acoustics,
and sink methods: an application to the Android framework for data leak detec- Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 2656–2660.
tion, in: 2023 IEEE International Conference on Software Analysis, Evolution and [53] D. Rabadi, S.G. Teo, Advanced windows methods on malware detection and classi-
Reengineering (SANER), IEEE, 2023, pp. 783–794. fication, in: Annual Computer Security Applications Conference, 2020, pp. 54–68.
[23] statcounter GlobalStats, Desktop operating system market share worldwide, https:// [54] Y. Qiao, Y. Yang, L. Ji, J. He, Analyzing malware by abstracting the frequent itemsets
gs.statcounter.com/os-market-share/desktop/worldwide. (Accessed 1 September in api call sequences, in: 2013 12th IEEE International Conference on Trust, Security
2023). and Privacy in Computing and Communications, IEEE, 2013, pp. 265–270.
[24] G.H. Andreas Marx, Oliver Marx, Total amount of malware and pua, https://portal. [55] G.E. Dahl, J.W. Stokes, L. Deng, D. Yu, Large-scale malware classification using
av-atlas.org/malware. (Accessed 1 August 2023). random projections and neural networks, in: 2013 IEEE International Conference on
[25] D. Uppal, R. Sinha, V. Mehra, V. Jain, Exploring behavioral aspects of api Acoustics, Speech and Signal Processing, IEEE, 2013, pp. 3422–3426.
calls for malware identification and categorization, in: 2014 International Con- [56] B. Ndibanje, K.H. Kim, Y.J. Kang, H.H. Kim, T.Y. Kim, H.J. Lee, Cross-method-based
ference on Computational Intelligence and Communication Networks, IEEE, 2014, analysis and classification of malicious behavior by api calls extraction, Appl. Sci.
pp. 824–828. 9 (2) (2019) 239.
[26] P. Maniriho, A.N. Mahmood, M.J.M. Chowdhury, Api-maldetect: automated mal- [57] X. Zhang, Y. Zhang, M. Zhong, D. Ding, Y. Cao, Y. Zhang, M. Zhang, M. Yang,
ware detection framework for windows based on api calls and deep learning tech- Enhancing state-of-the-art classifiers with api semantics to detect evolved Android
niques, J. Netw. Comput. Appl. (2023) 103704. malware, in: Proceedings of the 2020 ACM SIGSAC Conference on Computer and
[27] C. Guarnieri, Automated malware analysis, https://cuckoosandbox.org/. (Ac- Communications Security, 2020, pp. 757–770.
cessed 27 August 2021). [58] R. Pascanu, J.W. Stokes, H. Sanossian, M. Marinescu, A. Thomas, Malware classifica-
[28] F.O. Catak, Windows malware dataset with pe api calls, https://github.com/ocatak/ tion with recurrent networks, in: 2015 IEEE International Conference on Acoustics,
malware_api_class. (Accessed 27 August 2021). Speech and Signal Processing (ICASSP), IEEE, 2015, pp. 1916–1920.
[29] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, Graph atten- [59] H. Jiang, T. Turki, J.T. Wang, Dlgraph: malware detection using deep learning and
tion networks, arXiv preprint arXiv:1710.10903, 2017. graph embedding, in: 2018 17th IEEE International Conference on Machine Learning
[30] H. Cai, V.W. Zheng, K.C.-C. Chang, A comprehensive survey of graph embedding: and Applications (ICMLA), IEEE, 2018, pp. 1029–1033.
problems, techniques, and applications, IEEE Trans. Knowl. Data Eng. 30 (9) (2018) [60] B. Souani, A. Khanfir, A. Bartel, K. Allix, Y. Le Traon, Android malware detection
1616–1637. using bert, in: International Conference on Applied Cryptography and Network Se-
[31] N. Allan, J. Ngubiri, Windows pe api calls for malicious and benigin programs, curity, Springer, 2022, pp. 575–591.
09.2019. [61] S. Jeon, G. Lee, H. Kim, S.S. Woo, Smartcondetect: highly accurate smart contract
[32] Y. Ki, E. Kim, H.K. Kim, A novel approach to detect malware based on api call code vulnerability detection mechanism using bert, in: KDD Workshop on Program-
sequence analysis, Int. J. Distrib. Sens. Netw. 11 (6) (2015) 659101. ming Language Processing, 2021.
[33] Z. Liu, W. Lin, Y. Shi, J. Zhao, A robustly optimized bert pre-training approach with [62] X. Lin, G. Xiong, G. Gou, Z. Li, J. Shi, J. Yu, Et-bert: a contextualized datagram
post-training, in: China National Conference on Chinese Computational Linguistics, representation with pre-training transformers for encrypted traffic classification, in:
Springer, 2021, pp. 471–484. Proceedings of the ACM Web Conference 2022, 2022, pp. 633–642.
[34] N. Reimers, I. Gurevych, Sentence-bert: sentence embeddings using Siamese bert- [63] A. Rahali, M.A. Akhloufi, Malbert: malware detection using bidirectional encoder
networks, arXiv preprint arXiv:1908.10084, 2019. representations from transformers, in: 2021 IEEE International Conference on Sys-
[35] M. Ring, D. Schlör, S. Wunderlich, D. Landes, A. Hotho, Malware detection on win- tems, Man, and Cybernetics (SMC), IEEE, 2021, pp. 3226–3231.
dows audit logs using lstms, Comput. Secur. 109 (2021) 102389. [64] S. Ahn, S. Ahn, H. Koo, Y. Paek, Practical binary code similarity detection with bert-
[36] E.B. Karbab, M. Debbabi, Maldy: portable, data-driven malware detection using nat- based transferable similarity learning, in: Proceedings of the 38th Annual Computer
ural language processing and machine learning techniques on behavioral analysis Security Applications Conference, 2022, pp. 361–374.
reports, Digit. Investig. 28 (2019) S77–S87. [65] L. Deng, H. Wen, M. Xin, H. Li, Z. Pan, L. Sun, Enimanal: augmented cross-
[37] T.K. Tran, H. Sato, Nlp-based approaches for malware classification from api se- architecture iot malware analysis using graph neural networks, Comput. Secur.
quences, in: 2017 21st Asia Pacific Symposium on Intelligent and Evolutionary (2023) 103323.
Systems (IES), IEEE, 2017, pp. 101–105.
[38] A. Brown, M. Gupta, M. Abdelsalam, Automated machine learning for deep learning
based malware detection, arXiv preprint arXiv:2303.01679, 2023. Pengbin Feng received the Ph.D. degrees in computer science from Xidian Univer-
[39] J. Geng, J. Wang, Z. Fang, Y. Zhou, D. Wu, W. Ge, A survey of strategy-driven sity, Xi’an, Shaanxi, China in 2019. He is currently a lecturer in the School of Cyber
evasion methods for pe malware: transformation, concealment, and attack, Comput. Engineering, Xidian University. His research interests include malware detection and bi-
Secur. (2023) 103595. nary analysis.
[40] W. You, Z. Zhang, Y. Kwon, Y. Aafer, F. Peng, Y. Shi, C. Harmon, X. Zhang, Pmp:
cost-effective forced execution with probabilistic memory pre-planning, in: 2020 Le Gai is currently a BSc student with the School of Computer Science and Tech-
IEEE Symposium on Security and Privacy (SP), IEEE, 2020, pp. 1121–1138. nology at Xidian University, China. His current research interests include learning-based
[41] F. Barr-Smith, X. Ugarte-Pedrero, M. Graziano, R. Spolaor, I. Martinovic, Survival- security detection and privacy protection.
ism: systematic analysis of windows malware living-off-the-land, in: 2021 IEEE
Symposium on Security and Privacy (SP), IEEE, 2021, pp. 1557–1574. Li Yang received the Ph.D. degrees in computer science from Xidian University, Xi’an,
[42] M.A. Talib, Q. Nasir, A.B. Nassif, T. Mokhamed, N. Ahmed, B. Mahfood, Apt bea- Shaanxi, China in 2010. He is currently a Professor with the School of Computer Science
coning detection: a systematic review, Comput. Secur. (2022) 102875. & Technology, Xidian University. His current research interests include wireless network
[43] K. Aryal, M. Gupta, M. Abdelsalam, A survey on adversarial attacks for malware and system security.
analysis, arXiv preprint arXiv:2111.08223, 2021.
[44] J. Li, X. Fu, S. Zhu, H. Peng, S. Wang, Q. Sun, S.Y. Philip, L. He, A robust and Qin Wang received Ph.D. degree from the University of New South Wales in 2022.
generalized framework for adversarial graph embedding, IEEE Trans. Knowl. Data He is now an assistant researcher in the University of New South Wales. His research
Eng. (2023). interests include privacy protection and blockchain.
[45] M. Gupta, C. Akiri, K. Aryal, E. Parker, L. Praharaj, From chatgpt to threatgpt: im-
pact of generative ai in cybersecurity and privacy, IEEE Access (2023).
Teng Li received the Ph.D. degrees in computer science from Xidian University, Xi’an,
[46] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: a lite bert
Shaanxi, China in 2018. He is currently an Associate Professor in the School of Cyber En-
for self-supervised learning of language representations, arXiv preprint arXiv:1909. gineering, Xidian University. His current research interests include wireless and networks,
11942, 2019. distributed systems and intelligent terminals with focus on security and privacy issues.
[47] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert:
smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108, 2019.
Ning Xi received the Ph.D. degrees in computer science from Xidian University, Xi’an,
[48] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, S. Jegelka, Representation
Shaanxi, China in 2014. He is currently a Professor with the School of Cyber Engineering,
learning on graphs with jumping knowledge networks, in: International Conference
Xidian University. His research interests include home network, service computing, and
on Machine Learning, in: PMLR, 2018, pp. 5453–5462. network security.
[49] Y. Liu, M. Jin, S. Pan, C. Zhou, Y. Zheng, F. Xia, S.Y. Philip, Graph self-supervised
learning: a survey, IEEE Trans. Knowl. Data Eng. 35 (6) (2022) 5879–5900.
[50] G. Dong, M. Tang, Z. Wang, J. Gao, S. Guo, L. Cai, R. Gutierrez, B. Campbel, L.E. Jianfeng Ma received the Ph.D. degree in computer software and telecommunication
engineering from Xidian University, Xi’an, China, in 1995. He is currently a Professor and
Barnes, M. Boukhechba, Graph neural networks in iot: a survey, ACM Trans. Sens.
a Ph.D. Supervisor with the Department of Computer Science and Technology, Xidian
Netw. 19 (2) (2023) 1–50.
University. He is also the Director of the Shaanxi Key Laboratory of Network and System
[51] Y. Fang, B. Yu, Y. Tang, L. Liu, Z. Lu, Y. Wang, Q. Yang, A new malware classifi-
Security. His current research interests include information and network security, wireless
cation approach based on malware dynamic analysis, in: Information Security and and mobile computing systems, and computer networks.
Privacy: 22nd Australasian Conference, ACISP 2017, Auckland, New Zealand, July
3–5, 2017, Proceedings, Part II, vol. 22, Springer, Auckland, New Zealand, 2017,
pp. 173–189.

11

You might also like