0% found this document useful (0 votes)
24 views17 pages

De Bin Vul

Uploaded by

13871968964
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views17 pages

De Bin Vul

Uploaded by

13871968964
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Enhancing Reverse Engineering: Investigating and Benchmarking Large

Language Models for Vulnerability Analysis in Decompiled Binaries


Dylan Manuel12 , Nafis Tanveer Islam12 , Joseph Khoury 3 , Ana Nunez12 ,
Elias Bou-Harb, 3 Peyman Najafirad12
1
Secure AI an Autonomy Laboratory
2
University of Texas at San Antonio
3
Louisiana State University
arXiv:2411.04981v1 [cs.CR] 7 Nov 2024

Abstract Vulnerabilities and Exposures (CVEs) has been increasing


annually, from 14,249 in 2022, to 17,114 in 2023, and
Security experts reverse engineer (decompile) binary code to surging to 22,254 in 2024 (Qualys 2024). These CVEs
identify critical security vulnerabilities. The limited access are correlated with the Common Weakness Enumeration
to source code in vital systems – such as firmware, drivers,
and proprietary software used in Critical Infrastructures (CI)
(CWE) categories maintained by MITRE, which provide a
– makes this analysis even more crucial on the binary level. baseline for identifying, mitigating, and preventing security
Even with available source code, a semantic gap persists af- weaknesses during source code development. Notably,
ter compilation between the source and the binary code ex- during the compilation optimization, the source code
ecuted by the processor. This gap may hinder the detection transitions into binary code, resulting in mismatches and
of vulnerabilities in source code. That being said, current changes in code properties (Eschweiler et al. 2016). This
research on Large Language Models (LLMs) overlooks the inherently creates a vulnerability semantic discrepancy not
significance of decompiled binaries in this area by focus- addressed by CWEs or other vulnerability categorization
ing solely on source code. In this work, we are the first to systems. As such, vulnerability analysis of source code
empirically uncover the substantial semantic limitations of and binary code remains two distinct and separate areas
state-of-the-art LLMs when it comes to analyzing vulner-
abilities in decompiled binaries, largely due to the absence
of research (Mantovani et al. 2022). This phenomenon is
of relevant datasets. To bridge the gap, we introduce DeBin- succinctly captured by the statement, “What you see is not
Vul, a novel decompiled binary code vulnerability dataset. what you execute” (Balakrishnan and Reps 2010).
Our dataset is multi-architecture and multi-optimization, fo-
cusing on C/C++ due to their wide usage in CI and associ- Why Decompiled Binary Code Vulnerability Analysis?
ation with numerous vulnerabilities. Specifically, we curate
150,872 samples of vulnerable and non-vulnerable decom-
— Significance & Technical Challenges. Binary code
piled binary code for the task of (i) identifying; (ii) clas- (i.e., binaries/executables) is a fundamental component of
sifying; (iii) describing vulnerabilities; and (iv) recovering computing and digital systems taking the form of firmware,
function names in the domain of decompiled binaries. Sub- drivers/agents, and closed-source software. To safeguard
sequently, we fine-tune state-of-the-art LLMs using DeBin- these systems, reverse engineers attempt to uncover source
Vul and report on a performance increase of 19%, 24%, code from binary code using decompilation tools such as
and 21% in the capabilities of CodeLlama, Llama3, and Ghidra, angr, and IDA Pro, subsequently performing
CodeGen2 respectively, in detecting binary code vulnerabil- essential vulnerability analysis on decompiled binary code
ities. Additionally, using DeBinVul, we report a high perfor- (Burk et al. 2022). This is particularly important for two
mance of 80-90% on the vulnerability classification task. Fur- main reasons; first, access to source code is most of the
thermore, we report improved performance in function name
recovery and vulnerability description tasks. All our artifacts
time limited/restricted for proprietary or security reasons;
are available at 1 . second, vulnerabilities may not be apparent in the source
code, such as those related to the execution environment,
operating system, specific compiler optimizations, and
Introduction hardware specifications. For instance, use-after-free
(memory corruption) vulnerabilities, which affect many
It is crucial to perform vulnerability analysis in software
closed-source system components and network protocols
that plays a vital role in shaping Critical Infrastructure
written in C/C++, are known to be one of the most difficult
(CI) sectors such as water, energy, communications, and
types to identify using source code static analysis. (Lee et al.
defense, to name a few. Despite the many advancement
2015; Nguyen et al. 2020). On a different note, due to the
in software security, the reported number of Common
NP-complete nature of the compiler optimization problem
Copyright © 2025, Association for the Advancement of Artificial (Eschweiler et al. 2016), decompiled binary code loses im-
Intelligence (www.aaai.org). All rights reserved. portant constructs, such as structured control flow, complex
1 data structures, variable names, and function signatures
https://anonymous.4open.science/r/vuln-decompiled-
summarization-8017 (Burk et al. 2022). As a consequence, these setbacks impede
the ability of reverse engineers to analyze vulnerability in Table 1: Motivational Investigation: LLMs semantic gap
binary code, necessitating significant manual effort and time comparison between static source code and decompiled bi-
investment. nary code on vulnerability classification task. Reported av-
erage F1-scores.
Avant-garde Advancements and Perceived Opportuni-
ties. More recently, state-of-the-art Large Language Models Input Type GPT4 Gemini CodeLLaMa
(LLMs) have been employed as an optimizer to improve
the readability and simplicity of decompilers’ output, Source Code 0.75 0.68 0.64
ultimately reducing the cognitive burden of understanding Dec. Binary Code 0.67 ↓ 0.54 ↓ 0.54 ↓
decompiled binary code for reverse engineers (Hu, Liang,
and Chen 2024). Similarly, a cross-modal knowledge prober
coupled with LLMs have been utilized to effectively lift Mistral LLaMa 3 CodeGen2
the semantic gap between source and binary code (Su Source Code 0.60 0.45 0.64
et al. 2024). Furthermore, comprehensive benchmarking Dec. Binary Code 0.54 ↓ 0.33 ↓ 0.52 ↓
was conducted on ChatGPT/GPT-4 and other LLMs to
evaluate their effectiveness in summarizing the semantics
of binary code (Jin et al. 2023). This assessment revealed
the transformative capabilities of LLMs in the field while Firstly, we empirically investigate the analytical abilities
also highlighting key findings on their limitations, which of state-of-the-art LLMs and uncover a vulnerability se-
demands further research. While these efforts aim to mantic gap between source and decompiled binary code.
improve the readability and comprehension of decompiled Our investigation encompasses real-world code injection in
binary code semantics, they overlook the vulnerability public repositories, simulating an emergent cybersecurity
semantic gap between source code and deccompiled binary. attack that targets the widely recognized Linux-based XZ
To date, no comprehensive research has been conducted to Utils (Akamai 2023). Secondly, we introduce DeBinVul
thoroughly investigate and explore the potential of LLMs a novel decompiled binary vulnerability dataset with zero-
in decompiled binary code vulnerability analysis. This shot prompt engineering. Our dataset comprises relevant
task remains far from straightforward due to the following non-vulnerable and vulnerable source code samples, tagged
two main limitations; (i) lack of real-world decompiled with CWE classes, and compiled using Clang and GCC
binary code vulnerability datasets; and (ii) vulnerability across four different CPU architectures: x86, x64, ARM,
semantic gap between source and decompiled binary code and MIPS. During compilation, we applied two levels of
in LLMs. Currently state-of-the-art LLMs are trained on optimizations; O0 and O3 . Then, using GHIDRA we decom-
textual-like input, including source code, but they lack pile the compiled code to obtain the decompiled binary code
semantic knowledge of vulnerabilities in the decompiled samples. Furthermore, we augment our dataset with code
binary code domain due to the absence of representative descriptions and instruction/prompting techniques. Thirdly,
datasets. Through an empirical and pragmatic investigation we fine tune and optimize state-of-the-art LLMs, aiming
of the analytical abilities of LLMs, we find a consistent to enhance their capabilities in assisting reverse engineers
low performance of 67%, 54%, and 33% in decompiled in uncovering vulnerabilities in decompiled binary code. In
binary code, compared to a slightly higher performance of summary, the contributions of this paper are as follows:
75%, 68%, and 45% in source code with GPT4, Gemini,
and LLaMa 3, respectively. Table 1 highlights some of • To the best of our knowledge, we are the first to em-
the insights we derived from our investigation. More in- pirically investigate the vulnerability semantic gap be-
formation on the investigation is provided in the Appendix tween source and decompiled binary code in state-of-the-
and Table 8 in Section Source & Decompiled Binary art LLMs. Our findings highlight the suboptimal perfor-
Code Vulnerability Semantic Gap: Investigating LLMs’ mance of these models in performing vulnerability anal-
Analytical Abilities. To this end, significant manual effort ysis on decompiled binaries.
is required to curate decompiled binary code samples that
include relevant vulnerabilities, realistic compilation and • We compile and release, DeBinVul, a novel decompiled
decompilation settings, and representative input formats binary code vulnerability dataset comprising 150,872
for LLMs. Moreover, this entails of state-of-the-art LLMs samples of openly sourced, synthetically generated, and
through extensive fine-tuning and instructive/prompting manually crafted corner case C/C++ code samples. It
techniques. is designed to tackle four important binary code vulner-
ability analysis tasks, including vulnerability detection,
classification, description, and function name recovery.
Our Contribution. To tackle these challenges and capitalize
on the perceived opportunities, this work aims to ask: • We employ our proposed dataset to fine-tune and en-
Can we enhance reverse engineering by bridging the hance the reverse engineering capabilities across a range
semantic gap between source and decompiled binary code of state-of-the-art LLMs. Our results shows a perfor-
vulnerability analysis in state-of-the-art LLMs? mance increase of 19%, 24%, and 21% in the capabili-
ties of CodeLlama, Llama 3, and CodeGen2 respec-
To answer this question, we undertake the following quests. tively, in detecting vulnerabilities in binary code.
Proposed Approach analysis. While these two datasets offer either fully syn-
In order to mitigate the challenges faced by LLMs in un- thetic or fully real vulnerabilities, our method of injecting
derstanding decompiled binaries and improve their perfor- vulnerabilities tries to overcome the issues we see in NVD.
mance in understanding their security impact, we propose Together, these datasets allow the LLM to generalize from
an extensive dataset comprised of source code and their de- structured, annotated examples to broader, complex, real-
compiled binaries. Further details are provided in the sequel. world scenarios, resulting in a more versatile model that can
Figure 1 highlights our entire architecture. analyze decompiled binaries across various software con-
texts.
Step 1: Data Collection Hence, we opt to construct a dataset DeBinVul by extend-
We compiled our dataset from three distinct sources: the Na- ing the capabilities of MVD to analyze decompiled binaries
tional Vulnerability Database (NVD), the Software Assur- with instructions to align with LLMs.
ance Reference Dataset (SARD), and a collection of real-
world code enhanced with synthetically added vulnerabili- Step 2: Data Processing and Increase
ties to cover corner cases where real-world code is not avail- Function Extraction. We developed a methodology using
able for certain vulnerabilities for proprietary and security the Tree-Sitter parser to extract the functions from a file.
reasons. Since we are extracting C/C++ functions, we use C or C++
NVD. NVD provides real-world source code vulnerabil- grammar for function extraction. If the file name suffix is
ities which were reported by developers and security ex- “.c,” then a C grammar is used; otherwise, if the suffix is
perts from various repositories. But these are often individ- “.cpp,” a C++ grammar is used. The available functions from
ual functions, making it difficult to compile them into exe- SARD have some special signatures in the function declara-
cutables and decompile them due to unclear documentation tion which helps us to determine whether the function is vul-
and library dependencies. Therefore, during collection, we nerable or not. For example, if the function name contains
had to skip the ones which were not compilable. the term “good” or “bad”, we consider them non-vulnerable
Additionally, the NVD often lacks coverage of prepara- or vulnerable functions. Furthermore, if the function if vul-
tory vulnerabilities, such as those involving specific config- nerable, the function name also contains the CWE number as
urations or security mechanism bypasses, which don’t lead well. We utilize this information to annotate the vulnerable
directly to exploits but can set the stage for more severe and the non-vulnerable functions and find the CWE num-
issues. While the information on the vulnerabilities is ex- bers of the vulnerable function using regular expression. The
posed, the source code is not exposed to NVD since it may code from NVD and our injection technique are functions.
contain sensitive information like code or data structure of Therefore, they do not need to be extracted.
the system, file, or operating system. These preparatory or Compilation and Optimization. After we have extracted
indirect vulnerabilities are often not published in the NVD, the vulnerable and non-vulnerable functions, we locate the
as they require specific, often complex conditions to mani- necessary header files and other source code files needed to
fest in a real-world exploit. compile the CWE file. These files are conveniently located
SARD. SARD is a valuable resource for the software se- in the same directory as the CWE file. Each source code
curity community. It’s a curated collection of programs con- function was compiled six times to ensure comprehensive
taining deliberate vulnerabilities. Researchers and tool de- analysis, resulting in six binaries of a single function. This
velopers use SARD to benchmark their security analysis process involved using two compilers, two optimization lev-
tools, identifying strengths and weaknesses. By exposing els, and four architectures.
programs to a wide range of known vulnerabilities. SARD, Decompilation. We used Ghidra (NSA 2019) to de-
while providing code examples with known vulnerabilities compile the functions. Decompilation can usually be done
and all the code samples are executable, it often lacks real- two ways, stripping and non-stripping the function or vari-
world complexity and diversity. able names. In real-world applications, the functions are
Vulnerability Injection. Furthermore, we proposed an variable names are stripped due to security reasons. There-
innovative automatic injection process to inject vulnerabil- fore, we emulate the same process by stripping function and
ities in the source of real-world repositories to emulate this variable names during compilation using a special flag −s
scenario. Table 9 briefly describes the various LLMs we an- during compilation.
alyze for vulnerability. After getting the repositories, we in- Description. The functions we extracted mainly contain
jected vulnerabilities into randomly selected functions from comments written by software security experts. However,
the repositories by prompt engineering using LLMs. We se- these comments are partial and explain only a particular
lected 8 of the top 25 CWE vulnerabilities from MITRE statement. However, there are multiple levels of comments
that are common in C/C++ programs. Out of the initial 500 for some important vulnerable-prone lines. Therefore, we
randomly selected code samples, we injected vulnerabilities use tree-sitter to create a method to define comments in
into 462, of which 38 were not compilable and were sub- C/C++. Then, we use the definition of the method to ex-
sequently ignored. We provide the details of the repository tract the comments inside these functions. Finally, we use
selection and vulnerability injection process in the Appendix the source code without the comments and the extracted
(Vulnerability Injection Process). comments and prompt GPT-4 to write a clean, comprehen-
Combining the SARD and the NVD for training LLMs sive description of the code using a prompt. Furthermore, we
can significantly enhance their capabilities in vulnerability also want to ensure that we use the same description when
Step 1: Data Collection Step 2: Data Processing and Increase
Compile Source Code
Decompile Binary Code
Source with different optimization Binary Decompiled
Vulnerable/Non-Vulnerable using GHIDRA
Code settings and CPU Arch. Code Binary Code

Real-World Source Code Initial Data Attributes 1 2


Source Code CWE Info Label
Vulnerable/Non-Vulnerable
Generate Code Increased Data Attributes
Synthetic Source Code </> CWE-20XX-XXXXX Vul/Non-Vul Description
using ChatGPT-4, vetted Source Decompiled CWE
Hand-Crafted Corner Case by Security Experts Code Binary Code Description Info Label
Source Code
Description CWE-XXXX Vul/
Vulnerable Source Code 3 </> < 1001 > text -XXXXX Non-Vul

Step 3: Create the Decompiled Binary Code Vulnerability Step 4: Fine Tune state-of-the-art LLMs with DeBinVul to Enhance
(DeBinVul) Instruct-based Dataset Reverse Engineering and Vulnerability Analysis in Binary Code
1: bool 0x818042DOE (long param_1, int param_2){
Answers Identify
Decompiled
2: int iStack_c; DeBinVul SOTA LLMs-
3: char buffer[10]; Vulnerabilities
Binary Code 4: for(iStack_c=0; iStack_c < param_2; DeBinVul 4
(DBC)
iStack_c=iStack_c+1){ Classify
n: ...}
Vulnerabilities
n+1: return iStack_c == param_2;} Questions with inputs:
1 Dataset Fine Tuning 2 3
<Instruction_1> As a specialist in code security, as input loss func. Determine
DeBinVul assess this decompiled code. Is there any
DBC I
Func. names
vulnerability (Yes/No)?
Data Sample </Instruction_1> Answer_1: Yes
Instructions Describe
(I) <Instruction_2> Your task is to identify the SOTA LLMs Reverse

Loss
function name from this decompiled Vulnerabilities
code </Instruction_2> Answer_2: JsonAllAlphaNum Training Step
Engineers
<Instruction_3> ... <Instruction_4> ...

Figure 1: Our Proposed Approach: An overview of our proposed instruct dataset DeBinVul with a sample example comprising
a decompiled binary code input and a list of questions (instructions) and answers. Subsequently, using DeBinVul, we train state-
of-the-art LLM models to optimize them and elevate their capabilities in assisting reverse engineers in unveiling vulnerabilities
in binary code.

we use decompiled functions. Therefore, ensure that func- and < EOS > at the start and end of the program, respec-
tion and variable names are not present when describing the tively, and pad sequences shorter than 32000 tokens with
function objectives and vulnerabilities. < P AD >. The tokenized decompiled code is then used as
input for the model.
Step 3: Instructions Model Training and Optimization. In this work, we ex-
We provide an instruction-based dataset, enabling the user or plore the application of generative language models to four
developer to use our system by providing instructions with tasks: i) Vulnerability identification, ii) Vulnerability clas-
code. Therefore, we created four types of robust instructions. sification, iii) Function name prediction, and iv) Description
We created four carefully curated prompts to instruct GPT- generation. Although the first two tasks are typically classifi-
4 to create 20 instructions for each task; therefore, we have cation tasks (binary and multiclass, respectively), we convert
80 instructions. Moreover, we provided 2 sample examples all four tasks into generative ones by leveraging our model’s
with each prompt that would guide GPT-4 to generate the instruction-following capability. Specifically, the model out-
most appropriate comment. These instructions are manually puts “Yes/No” for vulnerability identification, generates a
appended during training and testing with the input code “CWE-XXX” code for classification, predicts a single token
based on the desired task. Table 14 shows the prompts we for the function name, and produces multiple tokens for de-
used to generate 20 instructions for each task. The prompts scription generation, enabling a unified multitask approach.
generated by the instructions are available in our repository.
We provide more details on our data preparation in Section Evaluation
DeBinVul Dataset Preparation in Appendix.
In this section, we evaluate the effectiveness of our proposed
dataset DeBinVul by benchmarking them on state-of-the-art
Step 4: Fine Tuning Process LLMs and comparing their performance on the test set be-
Tokenization of Decompiled Code. We use a byte-pair fore and after fine-tuning. We evaluate our proposed system
encoding (BPE) tokenizer, common in natural language and to answer the following Research Questions (RQs):
programming language processing, to efficiently manage RQ1: Using our instruction-based dataset DeBinVul, how
large vocabularies by merging frequent byte or character effectively can it be used to identify and classify binaries
pairs into tokens. This approach reduces vocabulary size using different LLMs?
while preserving common patterns, balancing granularity RQ2: How do the trained models with our dataset per-
and efficiency for handling diverse language data. From each form in function name prediction and description?
function f , we extract a set of tokens T , trimming the input RQ3: Are the current LLMs generalized enough to ana-
size to 512 tokens. We also add special tokens < BOS > lyze vulnerabilities in different architectures and optimiza-
Table 2: RQ1: Vulnerability identification task compari- Answering Research Question 1
son between state-of-the-art LLMs vs. those trained on our In answering RQ1, we investigate the effectiveness of the
dataset, DeBinVul, referred as DBVul in table. impact of the proposed dataset in analyzing binary code for
four tasks, namely i) Vulnerability Identification, ii) Vul-
Model Training Acc Pre. Rec. F1 Acc.V Acc.B nerability Classification, iii) Function Name Prediction, and
iv) Description of Code Objective. Throughout answering
- 0.56 0.6 0.78 0.68 0.78 0.23 all our RQs for vulnerability identification and classifica-
CodeLLaMa
DBVul 0.85 0.89 0.86 0.87 0.86 0.84 tion, we use Accuracy, Precision, Recall, and F1 scores. We
- 0.59 0.65 0.83 0.73 0.83 0.13 use BLEU, Rouge-L, BERTScore, and Cosine Similarity for
CodeGen2 function name prediction and description generation. To an-
DBVul 0.91 0.93 0.94 0.94 0.94 0.86
swer RQ1, we only used O0 optimization on x86 architec-
- 0.48 0.71 0.42 0.53 0.42 0.61 ture. Table 2 shows the baseline comparison of the identifi-
Mistral
DBVul 0.89 0.95 0.88 0.91 0.88 0.9 cation task on binary code. The Training column with value
- 0.59 0.6 0.97 0.74 0.97 0.01 Base implies the results were before training the model,
StarCoder and Our DS denotes after the LLM was fine-tuned with our
DBVul 0.89 0.91 0.93 0.92 0.93 0.80
dataset. Overall all the LLMs, when trained with our pro-
- 0.57 0.7 0.68 0.69 0.68 0.34 posed dataset, show an improvement of F1 score of 18% or
LLaMa 3
DBVul 0.91 0.94 0.93 0.93 0.93 0.87 higher. While we see that without training, CodeGen2 and
StarCoder outperform by 59% in identifying vulnerability.
However, since this is a binary task, it is very close to a ran-
tion beyond their presence in their dataset? domized guess, which is approximately 50%. Moreover, if
we see the individual accuracy only on vulnerable and only
Evaluation Metrics on non-vulnerable or benign code, we can see that some
Our evaluation uses various task-specific metrics. For ex- models like CodeGen2 (Nijkamp et al. 2023), StarCoder (Li
ample, we used accuracy, precision, recall, and F1 scores et al. 2023b), and CodeLLaMa (Roziere et al. 2023) have
for vulnerability identification and detection tasks in de- significantly lower accuracy (Starcoder: 70% lower) in iden-
compiled code. Acc.V refers to accuracy when all the in- tifying the non-vulnerable or benign functions while main-
put functions are vulnerable, and Acc.B refers to accuracy taining a higher accuracy in identifying the vulnerable func-
when all the input functions to the model are benign or non- tions. This phenomenon concludes that these models prefer
vulnerable. However, we rely on metrics like BLEU (B.), to determine that most functions are vulnerable, hence the
Rouge-L (R.L), BERTScore (B.Score), and Semantic Sim- identification imbalance. However, after all the models were
ilarity (Sim.)for function name prediction and description individually trained on our proposed dataset, we see an over-
generation tasks. We put more details of the evaluation met- all increase in the accuracy and F1 score, and CodeGen2 and
rics in the Section Evaluation Metrics in the Appendix. LLaMa 3 top on this task with an accuracy of 91%, which is
almost a 30% improvement from the baseline models. Fur-
Experimental Analysis thermore, when we see the accuracy on only vulnerable and
only benign functions, we see that, for both of the cases,
Experimental Setup. For our evaluations, we split our De-
the performance has remained high where CodeGen is 94%
BinVul dataset into 80% training, 10% validation, and 10%
successful at finding the vulnerable functions and LLaMa
testing. The training data included source code from the
3 is 87% successful in finding the non-vulnerable or the be-
NVD dataset up to December 2021 to ensure that test data al-
nign functions. For classification, in Table 3, we show the F1
ways followed the training data chronologically. We trained
score comparison. We can see that all the base models have
all benchmark models on an NVIDIA DGX server with an
a classification F1 score of less that 5%, and interestingly,
AMD EPYC 7742 64-Core processor, 1TB of RAM, and
while CodeGen2 is a code-based LLM, it shows a score of
8 NVIDIA A100 GPUs. The model was trained for four
0 (zero) for vulnerability classification. Table 15 compares
epochs with a maximum token length of 512, a learning rate
all the CWEs in different models more in-depth. We provide
of 2e−5 , and a batch size of 4 for our 7B parameter model.
more details on the classification task in Subsection Further
A beam size of 1 and a temperature value of 1.0 were used
Discussion on RQ1 in Appendix.
for the generation task.
Answering Research Question 2
Table 3: RQ1: Vulnerability classification task comparison Our aim in answering RQ2 is to analyze one of the top-
between base LLMs vs. those trained on our dataset, DeBin- performing models to understand the performance of differ-
Vul, referred as DBVul in table. ent architectures. Hence, we selected CodeLLaMa for this
task to analyze the vulnerability of decompiled code. Here,
we again train the based models on the same four tasks we
C.LLaMa CodeGen2 MistralLLaMa3 St.Coder performed in RQ1. However, RQ2 differs from RQ1, using
Base 0.04 0 0.04 0.02 0.03 a multi-architecture compilation of source code into decom-
DBVul 0.81↑ 0.85↑ 0.83↑ 0.9↑ 0.84↑ piled code. For identification in Figure 2, we see that the
performance is close to approximately 90% when we test
Table 4: RQ1: Performance of LLMs on Function Name Pre- Performance on Identification on Different
diction and Description Generation tasks. Architectures
1
Task Train Model B. R.L. B.Score Sim 0.8
Prec. F1 0.6
Function Name Prediction

Our DS 0.65 0.75 0.96 0.96 0.81 0.4


CodeLLaMa
Base 0.00 0.00 0.77 0.78 0.10
0.2
Our DS 0.62 0.69 0.95 0.95 0.76
Mistral 0
Base 0.00 0.01 0.78 0.79 0.40 Accuracy Precision Recall F1 Accuracy Accuracy
(Vul) (Benign)
Our DS 0.64 0.72 0.96 0.96 0.79
CodeGen2 All architectures x86 ARM MIPS x64
Base 0.02 0.01 0.78 0.79 0.15
Our DS 0.53 0.69 0.94 0.95 0.79 Figure 2: RQ2: Comparison of identification task on differ-
StarCoder
Base 0.00 0.00 0.78 0.79 0.40 ent architectures on decompiled binaries
Our DS 0.66 0.77 0.97 0.97 0.83
LLaMa 3
Base 0.01 0.01 0.78 0.79 0.34
Our DS 0.12 0.29 0.89 0.88 0.77 Answering Research Question 3
CodeLLaMa
Base 0.01 0.17 0.78 0.79 0.39 Our goal in answering RQ3 is to demonstrate how well our
Our DS 0.13 0.30 0.89 0.88 0.78 CodeLLaMa performs when trained on a subset of archi-
Description

Mistral
Base 0.03 0.18 0.80 0.81 0.48 tectures and tested on a different subset of architectures for
function name prediction and description generation tasks.
Our DS 0.11 0.28 0.88 0.88 0.77
CodeGen2 In Table 6, the column “Train” depicts the set of architec-
Base 0.00 0.02 0.76 0.77 0.18
tures that were present during the training and the “Tes” col-
Our DS 0.09 0.25 0.83 0.85 0.71 umn defines the set of architectures that were present during
StarCoder
Base 0.00 0.02 0.76 0.77 0.18 the testing. however, we kept some overlap in the architec-
Our DS 0.13 0.30 0.89 0.88 0.78 tures between the training and testing for comparison tasks.
LLaMa 3 All − x64 defines that the model trained will all three archi-
Base 0.02 0.18 0.83 0.83 0.50
tectures except x64, and ARM + x86 defines that the model
was only trained on ARM and x86 architecture. For function
name prediction, on Table 6, we can see that when the model
was trained without x64, we see a very slight performance
by combining all the architectures. However, we see an im- drop of only 1% on the Cosine Similarity score when tested
provement in Precision, F1, and accuracy in non-vulnerable on x64. However, when the model was trained on ARM and
or benign functions for MIPS. Moreover, we see a significant x86, we see that for x86, there was a 4% drop in the perfor-
performance drop of 2-3% overall metrics for x64 archi- mance compared to ARM, while x86 was still in the train-
tecture, wherein the performance of x86, ARM, and MIPS ing data. Furthermore, for description, when the model was
remains relatively similar. Similarly, we see mixed results trained with All−x64, the performance of x64 only dropped
on F1 score for multiclass classification of CWEs in Figure by 2% for the Cosine Similarity score, and when the model
3. For example, on CWe-121, CWE-122, CWE-427, CWE- was trained on ARM + x86, and tested with “Al” we see
665, CWE-758, CWE-758 and, CWE-789 MIPS performs almost no performance change. Furthermore, we also tested
the higher. However, for CWE-401, CWE-690, and CWE- generalizability on O0 and O3 optimization levels on func-
761, we see a relatively stable performance across all archi- tion name prediction and description tasks. For both tasks,
tectures. An interesting observation from Figure 3 is that, the model was trained on O0 optimization and tested on the
for CWE-666, the F1 score goes down to zero, which im- 03 optimization level. We see a mere 1% improvement when
plies a limitation of our dataset on CWE-666. If we follow the model was trained and tested on the same optimization
the trend line of the moving average for ”All Architecture” level. From this analysis, we can safely conclude that using
we see that, overall, the model performs lower for CWE- different architectures has almost little to no effect on func-
126, CWE-617, CWE-666, and CWE-78 while maintaining tion name prediction and description generation tasks.
good performance on the other CWEs.
To evaluate the generalizability of Large Language Mod-
For the task of function name prediction and description els (LLMs) in real-world scenarios, we conducted a small-
generation in Table 5, the Cosine Similarity score shows a scale experiment using a generalized instruction-based
lower performance of x64 of 78% while MIPS and the com- dataset. Specifically, we tested Mistral and LLaMA 3 on the
bination of all architectures shows a 4% improvement of Stanford Alpaca dataset (Taori et al. 2023), performing in-
82% on this task. For the Description generation of decom- ference on the base models prior to training with our dataset.
piled code, we see a more stable score, where ARM, MIPS, Initial cosine similarity scores were 0.67 for Mistral and 0.73
and x64 top on a 76% similarity score, wherein x86 shows a for LLaMA 3. After training the models on our proposed
merely 2% lower performance of 74%. dataset, we reassessed performance on the Stanford Alpaca
Classification Performance on Different Architectures Table 6: RQ3: Generalizability Testing with Different Ar-
1 chitectures and Optimization Levels on predicting function
0.8
name and description generation.
0.6
0.4
0.2
Task Train Test B. R.L. B.Score Sim
0 Pre. F1
All ARM 0.61 0.72 0.96 0.96 0.79

C W - 78
C W 21

C W 22

C W 24

C W 26

C W 27

C W 01

C W 16

C W 27

C W 57

C W 90

C W 06

C W 17

C W 65

C W 66

C W 80

C W 90

C W 58

CW 1

9
76

78

Function Name
1

E
E-

E-

E-

E-

E-

E-

E-

E-

E-

E-

E-

E-

E-

E-

E-

E-

E-

E-

E-
CW

-x64 x64 0.61 0.71 0.96 0.96 0.78


All Architecture ARM MIPS x64 x64 2 per. Mov. Avg. (All Architecture)
All 0.62 0.70 0.96 0.96 0.77
ARM
x86 0.60 0.67 0.95 0.95 0.75
Figure 3: RQ2: Comparison of Classification on the combi- + x86
ARM 0.64 0.72 0.96 0.96 0.79
nation of all architectures and individual architectures O0 0.30 0.36 0,89 0.89 0.44
O0
O3 0.28 0.35 0.89 0.88 0.44
All ARM 0.12 0.31 0.89 0.89 0.77
dataset, observing that cosine similarity scores for Mistral

Description
-x64 x64 0.12 0.29 0.88 0.88 0.75
and LLaMA 3 dropped to 0.56 and 0.70, respectively. The ARM All 0.11 0.29 0.89 0.88 0.75
notable decrease in Mistral’s performance is likely due to + x86 x86 0.10 0.29 0.89 0.88 0.75
its smaller model size (2B parameters), which led to catas- O0 0.10 0.25 0.89 0.89 0.70
trophic forgetting when trained on new data, whereas the O0
O3 0.08 0.25 0.87 0.87 0.69
7B-parameter LLaMA 3 retained much of its prior learning.
Additionally, we conducted an N-day vulnerability analysis,
where LLaMA 3 and Mistral identified 15 and 6 N-day vul-
nerabilities, respectively. al. (Al-Kaswan et al. 2023) fine-tuned the CodeT5 model
(Wang et al. 2021) on decompiled function-summary pairs,
while HexT5 (Xiong et al. 2023) extended CodeT5 for tasks
Related Work like code summarization and variable recovery. BinSum (Jin
Recent advances in binary vulnerability detection have fo- et al. 2023) introduced a binary code summarization dataset
cused on leveraging intermediate representations and deep and evaluated LLMs such as GPT-4 (OpenAI 2023), Llama-
learning techniques to address the challenges posed by code 2 (Touvron et al. 2023), and Code-LlaMa (Roziere et al.
reuse. VulHawk (Luo et al. 2023) employed an intermediate 2023) across various optimization levels and architectures.
representation-based approach using RoBERTa (Liu et al. Additionally, Asm2Seq (Taviss et al. 2024) focused on gen-
2019) to embed binary code and applied a progressive search erating textual summaries of assembly functions for vulner-
strategy for identifying vulnerabilities in similar binaries. ability analysis, specifically targeting x86 assembly instruc-
Asteria-Pro (Yang et al. 2023) utilized LSTM (Hochreiter tions.
and Schmidhuber 1997) for large-scale binary similarity de-
tection, while VulANalyzeR (Li et al. 2023a) proposed an Conclusion
attention-based method with Graph Convolution (Kipf and
Welling 2017) and Control Flow Graphs to classify vulner- In this study, we present a comprehensive investigation of
abilities and identify root causes. QueryX (Han et al. 2023) large language models (LLMs) for the classification and
took a different approach by converting binaries into static identification of vulnerabilities in decompiled code and
source code through symbolic analysis and decompilation source code to determine the semantic gap. The primary
to detect bugs in commercial Windows kernels. In the realm contribution of our work is the development of the De-
of code summarization for decompiled binaries, Kawsan et BinVul dataset, an extensive instruction-based resource tai-
lored for vulnerability identification, classification, function
name prediction, and description across four architectures
Table 5: RQ2: Function Name Prediction and Description and two optimization levels. Our experiments demonstrate
when source code compiled on different architectures. that DeBinVul significantly improves vulnerability identifi-
cation and classification by up to 30% compared to baseline
Task Arch. BLEU Rouge-L BERTScore Sim. models on the x86 architecture. Additionally, we provide an
in-depth analysis of how different LLMs perform across var-
Description Func. Name

All 0.64 0.75 0.96 0.82


ious computer architectures for all four tasks. We also eval-
x86 0.62 0.72 0.96 0.80
uated how our proposed dataset aids LLMs in generalizing
ARM 0.67 0.75 0.96 0.81
across different architectures and optimization levels. Our
MIPS 0.68 0.77 0.96 0.82
results indicate that the LLMs maintained consistent perfor-
x64 0.61 0.71 0.95 0.78
mance even when exposed to new architectures or optimiza-
All 0.12 0.30 0.88 0.75 tion methods not included in the training data.
x86 0.11 0.29 0.88 0.74
ARM 0.11 0.30 0.88 0.76 References
MIPS 0.12 0.31 0.88 0.76
x64 0.13 0.30 0.88 0.76 Akamai. 2023. XZ Utils Backdoor — Everything You Need
to Know, and What You Can Do. Accessed: 2024-05-19.
Al-Kaswan, A.; Ahmed, T.; Izadi, M.; Sawant, A. A.; De- 2023b. StarCoder: may the source be with you! arXiv
vanbu, P.; and van Deursen, A. 2023. Extending source code preprint arXiv:2305.06161.
pre-trained language models to summarise decompiled bi- Lin, C.-Y. 2004. Rouge: A package for automatic evaluation
narie. In 2023 IEEE International Conference on Software of summaries. In Text summarization branches out, 74–81.
Analysis, Evolution and Reengineering (SANER), 260–271.
IEEE. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;
Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V.
Balakrishnan, G.; and Reps, T. 2010. Wysinwyx: What you 2019. Roberta: A robustly optimized bert pretraining ap-
see is not what you execute. ACM Transactions on Program- proach. arXiv preprint arXiv:1907.11692.
ming Languages and Systems (TOPLAS), 32(6): 1–84.
Luo, Z.; Wang, P.; Wang, B.; Tang, Y.; Xie, W.; Zhou, X.;
Brunsfeld, M.; Thomson, P.; Vera, J.; Hlynskyi, A.; Turn- Liu, D.; and Lu, K. 2023. VulHawk: Cross-architecture
bull, P.; Clem, T.; and Muller, A. 2018. tree-sitter/tree-sitter: Vulnerability Detection with Entropy-based Binary Code
v0. 20.0. Search. In NDSS.
Burk, K.; Pagani, F.; Kruegel, C.; and Vigna, G. 2022. De-
Mantovani, A.; Compagna, L.; Shoshitaishvili, Y.; and
comperson: How humans decompile and what we can learn
Balzarotti, D. 2022. The Convergence of Source Code and
from it. In 31st USENIX Security Symposium (USENIX Se-
Binary Vulnerability Discovery–A Case Study. In Proceed-
curity 22), 2765–2782.
ings of the 2022 ACM on Asia Conference on Computer and
Eschweiler, S.; Yakdan, K.; Gerhards-Padilla, E.; et al. 2016. Communications Security, 602–615.
Discovre: Efficient cross-architecture identification of bugs
in binary code. In Ndss, volume 52, 58–79. Nguyen, M.-D.; Bardin, S.; Bonichon, R.; Groz, R.; and
Lemerre, M. 2020. Binary-level directed fuzzing for {use-
Google. 2023. Gemini. Accessed: 2024-05-19. after-free} vulnerabilities. In 23rd International Sympo-
Han, H.; Kyea, J.; Jin, Y.; Kang, J.; Pak, B.; and Yun, I. 2023. sium on Research in Attacks, Intrusions and Defenses (RAID
QueryX: Symbolic Query on Decompiled Code for Finding 2020), 47–62.
Bugs in COTS Binaries. In 2023 IEEE Symposium on Secu- Nijkamp, E.; Hayashi, H.; Xiong, C.; Savarese, S.; and
rity and Privacy (SP), 3279–312795. IEEE. Zhou, Y. 2023. Codegen2: Lessons for training llms
Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term on programming and natural languages. arXiv preprint
memory. Neural computation, 9(8): 1735–1780. arXiv:2305.02309.
Hu, P.; Liang, R.; and Chen, K. 2024. DeGPT: Opti- NSA. 2019. Ghidra. https://ghidra-sre.org/. Software re-
mizing Decompiler Output with LLM. In Proceedings verse engineering framework.
2024 Network and Distributed System Security Symposium
OpenAI. 2023. GPT-4. Accessed: 2024-05-19.
(2024). https://api. semanticscholar. org/CorpusID, volume
267622140. Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.
Huang, J.-C.; and Leng, T. 1999. Generalized loop- Bleu: a method for automatic evaluation of machine trans-
unrolling: a method for program speedup. In Proceedings lation. In Proceedings of the 40th annual meeting of the
1999 IEEE Symposium on Application-Specific Systems and Association for Computational Linguistics, 311–318.
Software Engineering and Technology. ASSET’99 (Cat. No. Qualys. 2024. 2024 Midyear Threat Landscape Review.
PR00122), 244–248. IEEE. https://blog.qualys.com/vulnerabilities-threat-research/
Jiang, A.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chap- 2024/08/06/2024-midyear-threat-landscape-review. Vul-
lot, D.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lam- nerabilities Threat Research.
ple, G.; Saulnier, L.; et al. 2023. Mistral 7B (2023). arXiv Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan,
preprint arXiv:2310.06825. X. E.; Adi, Y.; Liu, J.; Remez, T.; Rapin, J.; et al. 2023. Code
Jin, X.; Larson, J.; Yang, W.; and Lin, Z. 2023. Binary code llama: Open foundation models for code. arXiv preprint
summarization: Benchmarking chatgpt/gpt-4 and other large arXiv:2308.12950.
language models. arXiv preprint arXiv:2312.09601. Sarkar, V. 2000. Optimized unrolling of nested loops. In
Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Clas- Proceedings of the 14th international conference on Super-
sification with Graph Convolutional Networks. In Interna- computing, 153–166.
tional Conference on Learning Representations (ICLR). Su, Z.; Xu, X.; Huang, Z.; Zhang, K.; and Zhang, X.
Lee, B.; Song, C.; Jang, Y.; Wang, T.; Kim, T.; Lu, L.; 2024. Source Code Foundation Models are Transfer-
and Lee, W. 2015. Preventing Use-after-free with Dangling able Binary Analysis Knowledge Bases. arXiv preprint
Pointers Nullification. In NDSS. arXiv:2405.19581.
Li, L.; Ding, S. H.; Tian, Y.; Fung, B. C.; Charland, P.; Ou, Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.;
W.; Song, L.; and Chen, C. 2023a. VulANalyzeR: Explain- Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023. Stanford
able binary vulnerability detection with multi-task learning Alpaca: An Instruction-following LLaMA model. https:
and attentional graph convolution. ACM Transactions on //github.com/tatsu-lab/stanford alpaca.
Privacy and Security, 26(3): 1–25. Taviss, S.; Ding, S. H.; Zulkernine, M.; Charland, P.; and
Li, R.; Allal, L. B.; Zi, Y.; Muennighoff, N.; Kocetkov, D.; Acharya, S. 2024. Asm2seq: Explainable assembly code
Mou, C.; Marone, M.; Akiki, C.; Li, J.; Chim, J.; et al. functional summary generation for reverse engineering and
vulnerability analysis. Digital Threats: Research and Prac- 787, 416, 20, 125, 476, 190, 119, 798 and reports their
tice, 5(1): 1–25. average F1-scores.
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; A carefully designed prompt was used to leverage the gen-
Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, erative capabilities of these LLMs, asking them to respond
S.; et al. 2023. Llama 2: Open foundation and fine-tuned with a “Yes” or “No” regarding the presence of vulnerabil-
chat models. arXiv preprint arXiv:2307.09288. ities. Additionally, the prompt required the LLMs to gener-
ate the corresponding CWE number if a vulnerability was
Wang, Y.; Wang, W.; Joty, S.; and Hoi, S. C. 2021. CodeT5:
detected. To classify the vulnerabilities, the prompt was ad-
Identifier-aware Unified Pre-trained Encoder-Decoder Mod-
justed to ensure that the LLMs only output the CWE cate-
els for Code Understanding and Generation. In Proceedings
gory. The results from these LLMs are summarized and an-
of the 2021 Conference on Empirical Methods in Natural
alyzed in Table 8. The specific prompts used in this analysis
Language Processing, 8696–8708. Online and Punta Cana,
are provided in Appendix Prompts and Investigation.
Dominican Republic: Association for Computational Lin-
guistics. Result Analysis The results from Table 8 in Appendix
Xiong, J.; Chen, G.; Chen, K.; Gao, H.; Cheng, S.; show that API-based models like GPT4 could accurately
and Zhang, W. 2023. HexT5: Unified Pre-Training for identify a vulnerability in decompiled binaries with an accu-
Stripped Binary Code Information Inference. In 2023 38th racy of 70%, where Gemini is at 56%. Moreover, open mod-
IEEE/ACM International Conference on Automated Soft- els like CodeLLaMa is at 61%, Mistral is at 54%, LLaMa
ware Engineering (ASE), 774–786. IEEE. 3 at 50%, and CodeGen2 at 54%. Moreover, from Table
8, we see that GPT-4 performs comparatively higher over-
Yang, S.; Dong, C.; Xiao, Y.; Cheng, Y.; Shi, Z.; Li, Z.; and
all than other API-based or open-access models. To inves-
Sun, L. 2023. Asteria-Pro: Enhancing Deep Learning-based
tigate the details of the identification task, we investigate
Binary Code Similarity Detection by Incorporating Domain
how accurately these LLMs can classify each vulnerability
Knowledge. ACM Transactions on Software Engineering
category. The results show that all models were experts at
and Methodology, 33(1): 1–40.
identifying some vulnerabilities and failing at others. For
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, example, GPT4, Gemini, Mistral, and LLaMa 3 produce
Y. 2019. Bertscore: Evaluating text generation with bert. higher performance on CWE-416, CodeLLaMa on CWE-
arXiv preprint arXiv:1904.09675. 20, LLaMa 3 on CWE-190, and CodeGen2 on CWe-476.
However, one interesting observation from our analysis is
Appendix that while CodeLLaMa, Mistral, and CodeGen 2 are mod-
erately successful in identifying vulnerability, these LLMs
In this appendix, we explain in detail the investigation injec- fail to predict most CWEs. Therefore, we can conclude that
tion process on different repositories and dataset collection CodeLLaMa has a very limited understanding of the vul-
methodologies. nerability category. Furthermore, Table 8 provides more de-
tailed numerical results, including Accuracy, Precision, Re-
Source & Decompiled Binary Code call, and F1 score. Section Reasoning of Weak Performance
Vulnerability Semantic Gap: Investigating in Investigation in Appendix explains the reason behind the
poor performance of LLMs when analyzing decompiled bi-
LLMs’ Analytical Abilities naries.
In this section, we are the first to empirically and pragmat-
ically investigate the analytical abilities of state-of-the-art Vulnerability Injection Process
LLMs in analyzing vulnerabilities in the decompiled binary
code domain. To explore this, we randomly select 200 pairs This section shows the different prompts we used to inves-
of vulnerable and non-vulnerable source code and decom- tigate state-of-the-art LLMs and for instruction generation
piled binary code samples from our proposed dataset, De- tasks. Table 12 shows the three prompts we needed to gener-
BinVul 2 for the task of classifying vulnerabilities. Specif- ate the outcomes for vulnerability identification and classifi-
ically, we evaluate the ability of several LLMs, includ- cation tasks. However, we noticed some disparity when the
ing ChatGPT-4 (OpenAI 2023), Gemini (Google 2023), models followed the same command. Initially, we created
CodeLLaMA-7B (Roziere et al. 2023), Mistral (Jiang et al. the prompt for GPT-4. However, when we used the same
2023), LLaMa 3 (Touvron et al. 2023), and CodeGen2 (Ni- prompt for Gemini, we saw that it produced some extra out-
jkamp et al. 2023), to identify and classify vulnerabilities puts, which are the URLs of the CWEs. Therefore, we had
in both source code and decompiled binary code. Table to ground the behavior by updating and adding instructions
8 presents the comparison results and underscores the se- with the prompt.
mantic vulnerability limitations of state-of-the-art black- The other LLM Instruction columns refer to CodeLLaMa,
box LLMs in classifying CWEs in decompiled binary code, Mistral, CodeGen2, and LLaMa 3. When generating the out-
in contrast to source code, which presented moderately bet- put, the model was generating the CWE description. How-
ter results. The reported results in Table 8 focuses on CWEs: ever, this time, we were unsuccessful in grounding that be-
havior using extra instruction. Therefore, when processing
2 the output generated by the model, we wrote extra code to
Please refer to the Proposed Approach Section for details on
the dataset. remove the description CodeLLaMa was generating.
OpenAI Validation
ChatGPT-4 of Code Functionality
1 2 3 4 5 6 7 LLM
LLM
Models
LLM
Models
GitHub Extract Vulnerable Binary Binary Models
Repositories Code Injection Compilation Decompilation
Real-World SW
Functions Real-world CWEs of Code Binary using GHIDRA Code

Data Collection and Parsing Code Injection Validation and Compilation Decompilation to Binary Code and Investigation against LLM models

Figure 4: A high-level overview of our investigation process: From Vulnerability Injection to Vulnerability Analysis Using
Decompiled Code

Table 7: Number of functions and total counts in various sarial dataset, totaling 276 decompiled code samples. Then,
top C/C++ general-purpose and IoT repositories where we we compiled both repositories with the injected vulnerable
injected 8 CWE vulnerabilities into some randomly selected functions using the GCC compiler on a DGX A100 server
functions. with an x86 64 Linux operating system. Then, we used
Ghidra (NSA 2019) to decompile binaries into decompiled
Repository Name Domain Total code.
Linux Kernel Operating System 3 Although we provided GPT-4 with strict instructions on
Apache HTTPD Web Server 7 injecting vulnerabilities without creating potential errors,
OpenSSL Security Library 2 GPT-4 occasionally introduced compiler errors that would
FFmpeg Multimedia Framework 6 prevent the build of the vulnerable repository. Some of these
cURL Data Transfer 5 compiler errors included accessing fictitious fields of struc-
MQTT IoT Protocol 2 tures, calling functions that did not exist, and minor syntax
Zigbee IoT Networking 8 errors. Initially, we randomly picked 200 code samples to in-
Node.js Runtime Environment 3 ject vulnerability. However, out of the 175 samples, 62 sam-
SQLite Database 61 ples were not compilable, which we ignored.
Json-C Data Format 77 Table 11 shows the total number of decompiled vulnera-
ble functions per CWE category.

We initially extracted all the functions from these ten Reasoning of Weak Performance in
repositories to investigate their effectiveness. Then, we
randomly selected some functions to inject vulnerabilities
Investigation
which are demonstrated in Table 8. After injecting the vul- Reasoning on Poor Performance on LLMs
nerabilities and fixing the compilation errors, we compile Reverse engineers face many challenges when analyzing de-
each repository into its binaries and decompile them back to compiled code. Understanding the objective and semantics
the original code using Ghidra (NSA 2019). As a result, the of decompiled code is generally more complex than under-
decompiled versions of the functions for which we injected standing the semantics of source code. During the decompi-
vulnerability are also vulnerable. Our analysis includes iden- lation process, the variables or the flow of statements some-
tification, classification, and function name prediction of the times get changed for optimization. As a result, working
decompiled code. Table 7 summarizes the code we gener- with decompiled code for tasks such as vulnerability anal-
ated to investigate vulnerability across different LLMs. ysis or decompiled code summarization (Al-Kaswan et al.
Compiling an open-source repository is challenging be- 2023) is more challenging. Some of the primary reasons for
cause it requires many software and hardware dependencies poor performance could be directly related to the removal of
to run appropriately and be compiled into binaries listed in comments during decompilation, function name changes to
the makefile. We explored ten popular C repositories from the memory address of the function, complex control flow,
GitHub, mentioned in Table 7. Functions from these reposi- and obfuscation of variable names.
tories were used to generate an adversarial attack on code.
C1: Comments do not Exist. When source code is com-
After getting the repositories, we extracted the function
piled, the compiler typically ignores comments and no
name from the source code function by parsing the function
longer exists in the compiled binary. Therefore, comments
definition with Tree-Sitter (Brunsfeld et al. 2018) and using
are irrecoverable in decompiled code. Without comments,
an S-expression to extract the function name.
comprehending decompiled code is incredibly challenging
We randomly selected 200 functions from these reposito-
and time-consuming as it provides limited information about
ries and injected vulnerabilities using GPT-4. Each function
its purpose and intended behavior. Therefore, the reverse en-
was appended with instructions on how to inject vulnera-
gineer has to derive meaning from syntax and structure.
bility. However, some injected vulnerable code had compi-
lation errors. Therefore, we had to remove some of them, C2: Ambiguous Function Calls. When source code is
totaling 138 samples. Furthermore, we use the original non- compiled, the compiler may optimize the code by replac-
injected function as a non-vulnerable sample in our adver- ing standard function calls, such as strcpy, with a custom
Table 8: Performance of API-based and Open for Vulnerability Identification and Classification Tasks in Decompiled binaries.

CWE Number GPT-4 Gemini CodeLLaMa


Acc. Pre. Rec. F1 Acc. Pre. Rec. F1 Acc Pre. Rec. F1
Identification 0.70 0.80 0.70 0.67 0.56 0.57 0.56 0.54 0.61 0.78 0.61 0.54
CWE-787 0.52 0.5 0.26 0.34 0.23 0.5 0.11 0.19 0.0 0.0 0.0 0.0
CWE-416 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0
CWE-20 0.29 0.5 0.14 0.22 0.1 0.5 0.5 0.09 0.94 0.5 0.47 0.48
CWE-125 0.73 0.5 0.36 0.42 0.07 0.5 0.04 0.07 0.0 0.0 0.0 0.0
CWE-476 1.0 1.0 1.0 1.0 0.58 0.5 0.29 0.36 0.0 0.0 0.0 0.0
CWE-190 0.84 0.5 0.42 0.45 0.5 0.5 0.25 0.33 0.0 0.0 0.0 0.0
CWE-119 0.42 0.5 0.21 0.29 0.36 0.5 0.18 0.26 0.0 0.0 0.0 0.0
CWE-798 0.75 0.5 0.375 0.42 0.4 0.5 0.2 0.29 0.0 0.0 0.0 0.0
Mistral LLaMa 3 CodeGen 2
Acc. Pre. Rec. F1 Acc. Pre. Rec. F1 Acc. Pre. Rec. F1
Identification 0.54 0.51 0.52 0.54 0.5 0.25 0.5 0.33 0.54 0.60 0.51 0.52
CWE-787 0.0 0.0 0.0 0.0 0.84 0.5 0.47 0.48 0.0 0.0 0.0 0.0
CWE-416 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0
CWE-20 0.0 0.0 0.0 0.0 0.94 0.5 0.47 0.48 0.0 0.0 0.0 0.0
CWE-125 0.0 0.0 0.0 0.0 0.8 0.5 0.4 0.44 0.0 0.0 0.0 0.0
CWE-476 0.0 0.0 0.0 0.0 0.82 0.5 0.41 0.45 1.0 1.0 1.0 1.0
CWE-190 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0
CWE-119 0.41 0.32 0.57 0.35 0.90 0.50 0.44 0.47 0.0 0.0 0.0 0.0
CWE-798 0.0 0.0 0.0 0.0 0.93 0.50 0.46 0.47 0.0 0.0 0.0 0.0

function that performs the same task more efficiently. Alter- in a nested, parenthetical format. Each node in the tree
natively, the compiler may optimize the binary by inlining is represented as a list, starting with the node type fol-
the instructions of common function calls in the callee func- lowed by its children, which can be terminal tokens or fur-
tion, effectively removing the function call altogether. This ther nested lists. This format provides a clear and concise
may prove challenging for reverse engineers attempting to textual representation of the code’s syntactic structure. To
understand code semantics through commonly called func- extract the functions, we used the following S-expression,
tions. This usually happens during decompilation since the (f unction def inition)@f unc − def .
dissembler does not know the original function name, and After extracting the functions using S-expression, our
the replaced string 0x818042DOE is the function’s address next task is to separate vulnerable from non-vulnerable
in the system memory. functions. We found a specific pattern that makes this
C3: Complex Control Flow. The control flow of the task more straightforward for us. We observe that all
source code may be optimized and modified in the compiled function definitions that are extracted from the file ei-
binary. As a result, the decompiled code may have a more ther contain ”good” (a benign entry) or ”bad” (a vulner-
complex control flow of statements, which is difficult to un- able entry) in the function’s name. For each of the ex-
derstand from a reverse engineer’s perspective. A common tracted function definitions, we used another S-expression
example is loop unrolling (Sarkar 2000; Huang and Leng to extract function names from each function definition:
1999), where a loop is unraveled as a sequence of instruc- (f unction declarator(identif ier)@f unc name). Com-
tions instead of jumping back to the beginning of the loop plete definitions of these S-expressions are available in the
until a condition is met. However, these optimizations can repository we provided earlier.
sometimes be unusual and confusing when comprehending After extracting the functions and function names, our
the flow of decompiled code. next task is to classify the functions. This part is relatively
straightforward. If the function name contains the substring
DeBinVul Dataset Preparation ”good,” we consider it a benign or non-vulnerable func-
tion. However, if the function contains the sub-string ”bad,”
In this section, we highlight the technical details of how we we consider the function vulnerable. If the function is non-
extract each dataset component in detail. vulnerable, the function name has the format that appears
Function Extraction. To extract all function definitions as CWEXXX rest of function name. Therefore, we take the
from the file, we used the S-expression of Tree-Sitter. In first part of the function name (CWEXXX) to capture the
Tree-sitter, an S-expression (symbolic expression) is a way CWE number of the vulnerable code. Table 11 shows the
to represent the Abstract Syntax Tree (AST) of source code total number of decompiled functions we generated for each
CWE. Our study assesses the impact of compiler optimizations
Compilation and Optimization. Our compilation pro- on security and functionality by using the −O0 and −O3
cess strategically employs the −O0 and −O3 optimization optimization levels. The −O0 level, with no optimizations,
levels to assess the impact of compiler optimizations on the provides a direct correspondence to the source code, which
security and functionality of executables. By selecting these is essential for tracing vulnerabilities accurately. Conversely,
two extreme levels, we can thoroughly evaluate how opti- the −O3 level applies aggressive optimizations that can en-
mizations influence executable behavior in ways that inter- hance performance but also introduce or expose vulnerabili-
mediate levels, −O1 and −O2, may not fully reveal. The ties, simulating high-performance production environments.
−O0 level, which applies no optimization, ensures that the This dual approach allows us to capture a wide range of ef-
compiled code remains a straightforward representation of fects, providing a thorough analysis of how different opti-
the source code. This direct correspondence is critical for mization levels influence executable behavior. By compar-
accurately tracing vulnerabilities back to their source, pro- ing the extremes of no optimization and maximum optimiza-
viding a clear baseline for understanding the application’s tion, we ensure a robust evaluation that intermediate levels
intended behavior. might miss.
In contrast, the −O3 level introduces aggressive opti- Decompilation During decompilation, we use the extra
mizations such as function inlining, loop unrolling, and ad- flag −s. The −s flag in GCC instructs the compiler to strip
vanced vectorization. These can enhance performance and symbol information from the resulting executable, including
efficiency and potentially introduce or expose vulnerabili- function names. This significantly reduces the executable’s
ties, such as buffer overflows, specifically related to these size but also hinders debugging and reverse engineering ef-
optimization techniques. Moreover, −O3 mimics the high- forts. Stripped binaries can be more challenging to analyze
performance conditions often found in production environ- and understand, potentially making it more difficult for at-
ments, making it invaluable for simulating real-world ap- tackers to exploit vulnerabilities. However, it’s essential to
plication scenarios. This dual approach, employing both note that while -s removes human-readable function names,
−O0 and −O3, allows us to capture a comprehensive range it doesn’t obscure the underlying code logic or prevent ad-
of effects— from no optimization to maximum optimiza- vanced reverse engineering techniques.
tion—thereby providing a broad spectrum analysis of how Instruction Generation We created 80 instructions in to-
different optimization levels can affect an executable’s per- tal, and 20 instruction We created 20 instructions for each
formance, size, and, crucially, its security properties. This task, totaling 80 instructions for each task. We used four spe-
method ensures we identify any vulnerabilities introduced or cially curated prompts using GPT-4 to automatically gener-
masked by compiler optimizations, offering a robust evalua- ate the 80 instructions for all of the four tasks in decompiled
tion that intermediate optimization levels might overlook. binary analysis.
Post-processing. Following the extraction, compilation,
We utilized the following compiler commands: gcc − and decompilation of functions using Ghidra, several post-
O0(x86), gcc − O3(x86), clang − O0(x86), clang − processing steps were undertaken to ensure the dataset’s
O3(x86), aarch64 − linux − gnu − gcc − O0 (ARM), quality and suitability for vulnerability analysis. First, in-
and aarch64 − linux − gnu − gcc − O3 (ARM). The stances of gibberish code that would not compile were iden-
−DIN CLU DEM AIN option was included to define the tified and removed to prevent skewed results. Additionally,
main function necessary for compiling the CWE code. We empty methods containing only method names without ex-
compiled the source code twice for each compiler command ecutable code were eliminated, as they provided no value
using the −DOM IT GOOD and −DOM IT BAD options to the analysis. We also encountered numerous codes that
to generate the vulnerable and benign executables respec- were either identical or too similar, differing only in variable
tively. This systematic approach ensured that we could thor- names (e.g., ‘srini string‘ versus ‘string srini‘); these redun-
oughly examine the impact of different compilers, optimiza- dancies were systematically removed to maintain dataset
tion levels, and code variants on the security properties of diversity and prevent bias. Furthermore, many CWE cate-
the executables. gories lacked viable code examples due to insufficient train-
ing data in the original CWE dataset. To address this, we em-
ployed synthetic code generation and semi-supervised learn-
Table 9: A brief description of the six different LLMs we
ing techniques to augment the dataset, thereby increasing
tested to analyze their performance on analyzing code vul-
the representation of underrepresented CWEs. These post-
nerability.
processing steps were crucial in refining the dataset, ensur-
ing it was robust, diverse, and ready for accurate vulnerabil-
Model Src. Par. Modality Access ity analysis. Table 10 briefly overviews our proposed dataset.
GPT-4 OpenAI 1.7T Text/Code
API Evaluation Metrics
Gemini Google ∼ Text/Code
BLEU Score. The BLEU (Papineni et al. 2002) score is a
CodeLLaMa Meta AI 7B Code
syntax-based way of evaluating machine-generated text be-
CodeGen2 Salesforce 7B Code
tween 0 and 1. It is a reference-based metric and may not
LLaMa 3 Meta AI 8B Text/Code Open
capture all aspects of translation quality, such as fluency or
Mistral Mistral AI 7B Text/Code
semantic accuracy.
Table 10: Binary Functions across four Computer Architectures and two different Optimization Levels. Our proposed dataset
DeVulBin contains 150,872 binary functions, over 40 CWE Classes

Architecture Optimization Vulnerable Non-vulnerable Total


Avg. Avg. Token Non-Vul. Avg. Avg. Token
Vul Func.
Token/Line in Descs. Func. Token/Line in Descs.
x86 O0 11924 73/31 167 23743 66/29 96 35667
O3 11889 143/52 167 10345 151/45 95 22234
O0 5956 89/35 167 11867 85/34 96 17823
x64
O3 5962 157/53 167 7210 189/54 97 13172
O0 5962 83/35 167 11854 74/32 96 17816
ARM
O3 5960 123/48 167 7210 163/52 97 13170
O0 5960 81/32 167 11866 73/30 96 17826
MIPS
O3 5959 132/45 167 7205 204/57 97 13164
40 Different Categories of CWEs Totaling of 150872

Table 11: CWE Count Per Class of Decompiled Binary Code the final result. It is a token-based similarity representation
vector that represents tokens that permit generating a soft
CWE Number Count CWE Number Count similarity measure instead of exact matching since secure
code can be generated in various ways.
CWE-127 3344 CWE-666 799
CWE-590 3330 CWE-510 699 Cosine Similarity. BLEU, Rouge-L, BERTScore com-
CWE-124 3340 CWE-426 660 pares n-gram-based exact token matching or embedding-
CWE-121 3340 CWE-467 479 based token matching. However, the exact meaning can be
CWE-401 3339 CWE-415 370 kept intact for natural language while the description is
CWE-122 3339 CWE-506 340 written differently. Thus, we aim to measure the seman-
CWE-761 3339 CWE-475 320 tic similarity between the embedding of the LLM gener-
CWE-690 3339 CWE-319 319 ated text and the ground truth. Formally given the set of
CWE-126 3338 CWE-464 300 sequences of tokens generated by LLM description D =
CWE-427 3338 CWE-459 260 {w1 , w2 , ..., wn } and the ground truth reference description
CWE78 3338 CWE-773 230 D̂ = {wˆ1 , wˆ2 , ..., wˆn }. Then we used the sentence encoder
CWE-789 3338 CWE-476 180 E to generate the embedding E(D) = {ew1 , ew2 , ..., ewn }
CWE-606 3338 CWE-605 180 and E(D) = {ewˆ1 , ewˆ2 , ..., ewˆn }. Thus the entire semantics
CWE-680 2797 CWE-469 160
of D and D̂ are represented by:
CWE-665 1696 CWE-675 140
CWE-758 1677 CWE-404 100 1 X
m
1 X
n
CWE-123 1349 CWE-775 70 eD = ewi , eD̂ = ewj (1)
CWE-617 1087 CWE-681 70 |D| i=1 |D̂| j=1
CWE-457 980 CWE-688 40
CWE-416 830 CWE-685 30 Therefore, we calculate the similarity score as,

eD · eTD̂
sim(D, D̂) = (2)
Rouge-L. Similar to BLEU, Rouge-L (Lin 2004) score is ||eD || · ||eD̂ ||
also a number between 0 and 1 to measure the syntax-based
similarity of two generated texts. It generates a score by Further Discussion on RQ1.
quantifying precision and recall by examining the longest For the classification task, from Table 13, we can see that
common subsequence (LCS) between the generated and ref- CodeGen2 and Starcoder show the highest performance
erence codes. across most CWEs, with an F1 score over 90% for CWE-
590, CWE-121, and CWe-122. Furthermore, Mistral also
BERTScore. Furthermore, we use BERTScore (Zhang shows similar performance across all other CWEs. However,
et al. 2019) for semantic comparison using cosine similar- CodeLLaMa shows relatively poorer performance. How-
ity score to identify how the generated token matches the ever, we also notice that some CWEs like CWE-121, CodeL-
ground truth tokens. BERTScore generates an embedding LaMa, and LLaMa 3 show a 100% accurate performance.
vector for each generated token, performs a cosine similarity Moreover, Mistral (Jiang et al. 2023) shows a 100% accu-
with all the ground truth tokens, and averages it to generate racy for CWE-789 and CWE-123, which further elucidates
Table 12: The instructions we used to investigate LLMs to determine their performance on decompiled code vulnerability
identification and classification tasks.

GPT-4 Instruction Gemini Instruction Other LLM Instruction


You are an expert at analyzing software
You are an expert at analyzing software security vulnerabilities in static source
You are an expert at analyzing software
security vulnerabilities in static source code and decompiled binaries.
security vulnerabilities in static source
code and decompiled binaries. Task Description:
code and decompiled binaries.
Task Description: Task 1: Your task is to detect if a
Task Description: Task 1: Your task is to
Task 1: Your task is to detect if a vulnerability exists in the following
detect if a vulnerability exists in the
vulnerability exists in the following code. Answer with ”Yes/No” only
following code. Answer with ”Yes/No”
code. Answer with if a vulnerability exists. Task 2:
only if vulnerability exists; Task 2: If
”Yes/No” only if a vulnerability If your answer is ”Yes” to detecting
your answer is ”Yes” to detecting
exists. vulnerability, print only the CWE
vulnerability, print only the CWE
Task 2: If your answer is ”Yes” to number of the vulnerability. Do
number of the vulnerability. Do not
detecting vulnerability, print only not produce any other output more
produce any other output more than
the CWE number than you are asked. Do not use any
you are asked. Do not use any context
of the vulnerability. Do not produce context from any previous input or
from any previous input or output.
any other output more than you are output. Answer only in the following
Answer only in the following format, no
asked. Do not use any context from format; do not describe the CWE
extra link: ”Yes/No CWE-XXX”
any previous input or output. number and no
Code:
Code: extra link: ”Yes/No CWE-XXX”
Code:

that some models have a slight bias in classifying different


CWEs.
We use the same models trained for decompiled code
function name prediction and function description genera-
tion. From Table 4, we see that LLaMa 3 has the high-
est Function Name prediction performance across all other
models with a BERTScore F1 of 0.97 and a Cosine Simi-
larity score of 0.83. However, for the description generation
task, we can see that CodeLLaMa, Mistral, and LLaMa 3
perform superior compared to all other models, where Mis-
tral and LLaMa show an improvement of 7% over StarCoder
and merely 1% improvement on CodeLLaMa and Code-
Gen2.
Table 13: RQ1: Comparison of Classification on 20 CWEs against various LLMs

CWE
Model Metric 401 690 124 761 590 121 789 122 606 127
Pre. 0.83 0.9 0.94 0.97 0.93 1 0.68 0.59 0.71 0.93
CodeLLaMa Rec 0.51 0.79 1 0.9 0.9 1 0.48 0.96 0.52 0.61
F1 0.63 0.84 0.97 0.93 0.91 1 0.57 0.73 0.6 0.74
Pre. 0.98 0.98 0.81 1 0.86 0.86 0.92 1 0.79 0.59
CodeGen2 Rec 1 1 0.62 1 0.97 1 1 0.91 1 0.45
F1 0.99 0.99 0.7 1 0.91 0.93 0.96 0.95 0.88 0.51
Pre. 0.84 0.89 0.94 0.97 1 0.91 0.75 0.74 0.73 1
Mistral Rec 0.57 0.97 1 1 0.83 1 0.44 0.96 0.48 0.39
F1 0.68 0.93 0.97 0.98 0.91 0.95 0.56 0.84 0.58 0.56
Pre. 0.9 0.94 0.97 0.89 1 1 0.76 0.69 0.83 0.93
LLaMa 3 Rec 0.76 1 1 1 0.9 1 0.81 1 0.83 0.61
F1 0.82 0.97 0.99 0.94 0.95 1 0.79 0.82 0.83 0.74
Pre. 0.78 0.69 0.87 0.94 1 0.97 0.79 0.93 0.77 0.92
StarCoder Rec 0.57 0.85 1 1 0.9 1 0.41 1 0.74 0.52
F1 0.66 0.76 0.93 0.97 0.95 0.98 0.54 0.96 0.76 0.67
122 121 789 680 758 416 665 457 123 617
Pre. 0.9 0.51 1 0.78 0.94 1 0.62 0.73 1 0.75
CodeLLaMa Rec 0.86 1 0.81 0.9 0.89 0.87 0.71 0.92 0.73 0.9
F1 0.88 0.68 0.89 0.84 0.91 0.93 0.67 0.81 0.84 0.82
Pre. 0.95 0.95 0.94 1 0.92 0.76 0.7 0.8 1 0.57
CodeGen2 Rec 0.86 0.9 0.83 0.88 0.71 1 0.93 0.92 0.83 0.44
F1 0.9 0.92 0.88 0.94 0.8 0.86 0.8 0.86 0.91 0.5
Pre. 0.83 0.49 1 0.83 0.94 0.82 0.68 0.79 1 0.89
Mistral Rec 0.91 1 1 0.95 0.94 0.93 0.93 0.92 1 0.8
F1 0.87 0.66 1 0.88 0.94 0.88 0.79 0.85 1 0.84
Pre. 0.91 0.84 0.95 0.95 0.95 1 0.87 0.92 1 0.82
LLaMa 3 Rec 0.91 1 1 0.9 1 0.93 0.93 0.92 1 0.9
F1 0.91 0.91 0.98 0.92 0.97 0.97 0.9 0.92 1 0.86
Pre. 0.87 0.5 0.78 0.95 0.93 1 0.86 0.73 1 1
StarCoder Rec 0.91 1 1 1 0.78 0.93 0.86 0.92 0.82 0.8
F1 0.89 0.67 0.88 0.98 0.85 0.97 0.86 0.81 0.9 0.89
Table 14: Prompts used as an input for GPT-4 to generate 20 Instructions for each task, namely, i) Identification, ii) Classifica-
tion, iii) Function Name Prediction and iv) Description

Task Prompt
Consider yourself as a Code Security Reverse engineer Specialist.
Task: You are to generate a prompt to ask a large language model to Identify Vulnerabilities in decompiled
code. Make sure that the prompt ensures the answer is in ”Yes/No” format and that no extra content is
generated by the LLM when using the prompt.
Now, create 20 different variants of these prompts.
Here are two sample examples:
Ident. 1. ”You are a Reverse Engineer of Code. Detect the presence of events of vulnerabilities that exist in the
given code.
If a vulnerability exists, answer ”YES”, otherwise, answer ”NO”. Do not produce any extra outputs.
2. ”As an event Vulnerability expert of decompiled code, analyze the code to find whether any
vulnerability exists in the following code.
If you think the code has vulnerability, produce a single word ”YES”, and if vulnerability does not exist,
reply ”NO” only. Do not output any extra words or URLs.
Consider yourself as a Code Security Reverse engineer Specialist.
Task: You are to generate a prompt to ask a large language model to Classify Vulnerabilities in decompiled
code. The classification is based on the CWE numbers. Make sure that the prompt ensures the answer is in
”CWE-XXX”
format and that no extra content is generated by the LLM when using the prompt. Here ”XXX” refers to
the vulnerability number.
Now, create 20 different variants of these prompts.
Here are two sample examples:
Class. 1. ”You are a Reverse Software Engineer of Decompiled Code. Classify the presence of vulnerabilities
that exist
in the given code.
Generate the vulnerability classification in the following format only ”CWE-XXX”. Do not produce
any extra outputs.
2. ”As an expert classifying Vulnerability in decompiled code, analyze the code to find the
vulnerability category existing in the following code. Print only the CWE number of the vulnerability
in this format ”CWE-XXX”.
Do not output any extra words or URLs.
Consider yourself as a Code Security Reverse engineer Specialist who can predict function names on
decompiled binary code.
Task: You are to generate a prompt to ask a large language model to predict function name in decompiled code.
The function name can only have the characters supported in C/C++ programming language, and the outcome
would only consist of a single word, and no extra content is generated by the LLM when using the prompt.
Now, create 20 different variants of these prompts.
Func.
Here are two sample examples:
Name
1. ”You are a Reverse Software Engineer of Decopiled Code. Predict the name of the decompiled code.
Pred.
Generate the function name only in a single word. You can use camelCasing or Snake Casing.
Do not produce any extra outputs.
2. ”As an expert classifying Vulnerability in decompiled code, analyze the code to determine
function name in the following code.
Print only the Function name in a single word. You are allowed to use snake casing or camelCasing
to generate the function names. Do not output any extra words or URLs.
Consider yourself as a Code Security Reverse engineer Specialist who can generate the description of a
decompiled code.
Task: You are to generate a prompt to ask a large language model to Describe the objective
and/or Vulnerabilities in decompiled code. The description should explain the flow of the code but not
specify any variable or function names. The descriptions should be generic enough to be used in a
decompiled code as well.
Now, create 20 different variants of these prompts.
Here are two sample examples:
Desc.
1. ”You are a Reverse Software Engineer of Decopiled Code. Describe the objective and the security
issues that exist in the given code. Make sure you generate a generic description of the vulnerability
without specifying function or variable names. Please ensure that the generated description
can be used in decompiled code.
2. ”As an expert explaining the objectives and vulnerability in decompiled code, analyze the code
to explain the objective and vulnerability. Ensure the generalizability of the description by not
mentioning the function and the variable names as the description will be used in a decompiled
code where the variable and function names are obfuscated.”
Table 15: Average F1 Score Comparison on Base vs. Trained LLMs for Classification Task.

CodeLLaMa CodeGen2 Mistral LLaMa 3 StarCoder


Base Trained Base Trained Base Trained Base Trained Base Trained
CWE-124 0 0.63 0 0.99 0 0.68 0 0.82 0 0.66
CWE-427 0 0.84 0 0.99 0 0.93 0 0.97 0 0.76
CWE-401 0.11 0.97 0 0.7 0 0.97 0 0.99 0.11 0.93
CWE-761 0.12 0.93 0 1 0 0.98 0 0.94 0 0.97
CWE-590 0 0.91 0 0.91 0 0.91 0 0.95 0 0.95
CWE-690 0.13 1 0 0.93 0 0.95 0 1 0.13 0.98
CWE-127 0 0.57 0 0.96 0 0.56 0 0.79 0 0.54
CWE-606 0 0.73 0 0.95 0 0.84 0 0.82 0 0.96
CWE-126 0 0.6 0 0.88 0 0.58 0 0.83 0 0.76
CWE-78 0.16 0.74 0 0.51 0 0.56 0 0.74 0.16 0.67
CWE-122 0.09 0.88 0 0.9 0 0.87 0 0.91 0.08 0.89
CWE-121 0.07 0.68 0 0.92 0 0.66 0 0.91 0.08 0.67
CWE-789 0.09 0.89 0 0.88 0 1 0 0.98 0 0.88
CWE-680 0 0.84 0 0.94 0 0.88 0 0.92 0.1 0.98
CWE-758 0 0.91 0 0.8 0 0.94 0 0.97 0 0.85
CWE-416 0 0.93 0 0.86 0 0.88 0 0.97 0 0.97
CWE-665 0 0.67 0 0.8 0 0.79 0 0.9 0 0.86
CWE-457 0 0.81 0 0.86 0 0.85 0 0.92 0 0.81
CWE-123 0 0.84 0 0.91 0 1 0 1 0 0.9
CWE-617 0 0.82 0 0.5 0 0.84 0 0.86 0 0.89
Average 0.0385 0.8095 0 0.8595 0 0.8335 0 0.9095 0.033 0.844

You might also like