De Bin Vul
De Bin Vul
Step 3: Create the Decompiled Binary Code Vulnerability Step 4: Fine Tune state-of-the-art LLMs with DeBinVul to Enhance
(DeBinVul) Instruct-based Dataset Reverse Engineering and Vulnerability Analysis in Binary Code
1: bool 0x818042DOE (long param_1, int param_2){
Answers Identify
Decompiled
2: int iStack_c; DeBinVul SOTA LLMs-
3: char buffer[10]; Vulnerabilities
Binary Code 4: for(iStack_c=0; iStack_c < param_2; DeBinVul 4
(DBC)
iStack_c=iStack_c+1){ Classify
n: ...}
Vulnerabilities
n+1: return iStack_c == param_2;} Questions with inputs:
1 Dataset Fine Tuning 2 3
<Instruction_1> As a specialist in code security, as input loss func. Determine
DeBinVul assess this decompiled code. Is there any
DBC I
Func. names
vulnerability (Yes/No)?
Data Sample </Instruction_1> Answer_1: Yes
Instructions Describe
(I) <Instruction_2> Your task is to identify the SOTA LLMs Reverse
Loss
function name from this decompiled Vulnerabilities
code </Instruction_2> Answer_2: JsonAllAlphaNum Training Step
Engineers
<Instruction_3> ... <Instruction_4> ...
Figure 1: Our Proposed Approach: An overview of our proposed instruct dataset DeBinVul with a sample example comprising
a decompiled binary code input and a list of questions (instructions) and answers. Subsequently, using DeBinVul, we train state-
of-the-art LLM models to optimize them and elevate their capabilities in assisting reverse engineers in unveiling vulnerabilities
in binary code.
we use decompiled functions. Therefore, ensure that func- and < EOS > at the start and end of the program, respec-
tion and variable names are not present when describing the tively, and pad sequences shorter than 32000 tokens with
function objectives and vulnerabilities. < P AD >. The tokenized decompiled code is then used as
input for the model.
Step 3: Instructions Model Training and Optimization. In this work, we ex-
We provide an instruction-based dataset, enabling the user or plore the application of generative language models to four
developer to use our system by providing instructions with tasks: i) Vulnerability identification, ii) Vulnerability clas-
code. Therefore, we created four types of robust instructions. sification, iii) Function name prediction, and iv) Description
We created four carefully curated prompts to instruct GPT- generation. Although the first two tasks are typically classifi-
4 to create 20 instructions for each task; therefore, we have cation tasks (binary and multiclass, respectively), we convert
80 instructions. Moreover, we provided 2 sample examples all four tasks into generative ones by leveraging our model’s
with each prompt that would guide GPT-4 to generate the instruction-following capability. Specifically, the model out-
most appropriate comment. These instructions are manually puts “Yes/No” for vulnerability identification, generates a
appended during training and testing with the input code “CWE-XXX” code for classification, predicts a single token
based on the desired task. Table 14 shows the prompts we for the function name, and produces multiple tokens for de-
used to generate 20 instructions for each task. The prompts scription generation, enabling a unified multitask approach.
generated by the instructions are available in our repository.
We provide more details on our data preparation in Section Evaluation
DeBinVul Dataset Preparation in Appendix.
In this section, we evaluate the effectiveness of our proposed
dataset DeBinVul by benchmarking them on state-of-the-art
Step 4: Fine Tuning Process LLMs and comparing their performance on the test set be-
Tokenization of Decompiled Code. We use a byte-pair fore and after fine-tuning. We evaluate our proposed system
encoding (BPE) tokenizer, common in natural language and to answer the following Research Questions (RQs):
programming language processing, to efficiently manage RQ1: Using our instruction-based dataset DeBinVul, how
large vocabularies by merging frequent byte or character effectively can it be used to identify and classify binaries
pairs into tokens. This approach reduces vocabulary size using different LLMs?
while preserving common patterns, balancing granularity RQ2: How do the trained models with our dataset per-
and efficiency for handling diverse language data. From each form in function name prediction and description?
function f , we extract a set of tokens T , trimming the input RQ3: Are the current LLMs generalized enough to ana-
size to 512 tokens. We also add special tokens < BOS > lyze vulnerabilities in different architectures and optimiza-
Table 2: RQ1: Vulnerability identification task compari- Answering Research Question 1
son between state-of-the-art LLMs vs. those trained on our In answering RQ1, we investigate the effectiveness of the
dataset, DeBinVul, referred as DBVul in table. impact of the proposed dataset in analyzing binary code for
four tasks, namely i) Vulnerability Identification, ii) Vul-
Model Training Acc Pre. Rec. F1 Acc.V Acc.B nerability Classification, iii) Function Name Prediction, and
iv) Description of Code Objective. Throughout answering
- 0.56 0.6 0.78 0.68 0.78 0.23 all our RQs for vulnerability identification and classifica-
CodeLLaMa
DBVul 0.85 0.89 0.86 0.87 0.86 0.84 tion, we use Accuracy, Precision, Recall, and F1 scores. We
- 0.59 0.65 0.83 0.73 0.83 0.13 use BLEU, Rouge-L, BERTScore, and Cosine Similarity for
CodeGen2 function name prediction and description generation. To an-
DBVul 0.91 0.93 0.94 0.94 0.94 0.86
swer RQ1, we only used O0 optimization on x86 architec-
- 0.48 0.71 0.42 0.53 0.42 0.61 ture. Table 2 shows the baseline comparison of the identifi-
Mistral
DBVul 0.89 0.95 0.88 0.91 0.88 0.9 cation task on binary code. The Training column with value
- 0.59 0.6 0.97 0.74 0.97 0.01 Base implies the results were before training the model,
StarCoder and Our DS denotes after the LLM was fine-tuned with our
DBVul 0.89 0.91 0.93 0.92 0.93 0.80
dataset. Overall all the LLMs, when trained with our pro-
- 0.57 0.7 0.68 0.69 0.68 0.34 posed dataset, show an improvement of F1 score of 18% or
LLaMa 3
DBVul 0.91 0.94 0.93 0.93 0.93 0.87 higher. While we see that without training, CodeGen2 and
StarCoder outperform by 59% in identifying vulnerability.
However, since this is a binary task, it is very close to a ran-
tion beyond their presence in their dataset? domized guess, which is approximately 50%. Moreover, if
we see the individual accuracy only on vulnerable and only
Evaluation Metrics on non-vulnerable or benign code, we can see that some
Our evaluation uses various task-specific metrics. For ex- models like CodeGen2 (Nijkamp et al. 2023), StarCoder (Li
ample, we used accuracy, precision, recall, and F1 scores et al. 2023b), and CodeLLaMa (Roziere et al. 2023) have
for vulnerability identification and detection tasks in de- significantly lower accuracy (Starcoder: 70% lower) in iden-
compiled code. Acc.V refers to accuracy when all the in- tifying the non-vulnerable or benign functions while main-
put functions are vulnerable, and Acc.B refers to accuracy taining a higher accuracy in identifying the vulnerable func-
when all the input functions to the model are benign or non- tions. This phenomenon concludes that these models prefer
vulnerable. However, we rely on metrics like BLEU (B.), to determine that most functions are vulnerable, hence the
Rouge-L (R.L), BERTScore (B.Score), and Semantic Sim- identification imbalance. However, after all the models were
ilarity (Sim.)for function name prediction and description individually trained on our proposed dataset, we see an over-
generation tasks. We put more details of the evaluation met- all increase in the accuracy and F1 score, and CodeGen2 and
rics in the Section Evaluation Metrics in the Appendix. LLaMa 3 top on this task with an accuracy of 91%, which is
almost a 30% improvement from the baseline models. Fur-
Experimental Analysis thermore, when we see the accuracy on only vulnerable and
only benign functions, we see that, for both of the cases,
Experimental Setup. For our evaluations, we split our De-
the performance has remained high where CodeGen is 94%
BinVul dataset into 80% training, 10% validation, and 10%
successful at finding the vulnerable functions and LLaMa
testing. The training data included source code from the
3 is 87% successful in finding the non-vulnerable or the be-
NVD dataset up to December 2021 to ensure that test data al-
nign functions. For classification, in Table 3, we show the F1
ways followed the training data chronologically. We trained
score comparison. We can see that all the base models have
all benchmark models on an NVIDIA DGX server with an
a classification F1 score of less that 5%, and interestingly,
AMD EPYC 7742 64-Core processor, 1TB of RAM, and
while CodeGen2 is a code-based LLM, it shows a score of
8 NVIDIA A100 GPUs. The model was trained for four
0 (zero) for vulnerability classification. Table 15 compares
epochs with a maximum token length of 512, a learning rate
all the CWEs in different models more in-depth. We provide
of 2e−5 , and a batch size of 4 for our 7B parameter model.
more details on the classification task in Subsection Further
A beam size of 1 and a temperature value of 1.0 were used
Discussion on RQ1 in Appendix.
for the generation task.
Answering Research Question 2
Table 3: RQ1: Vulnerability classification task comparison Our aim in answering RQ2 is to analyze one of the top-
between base LLMs vs. those trained on our dataset, DeBin- performing models to understand the performance of differ-
Vul, referred as DBVul in table. ent architectures. Hence, we selected CodeLLaMa for this
task to analyze the vulnerability of decompiled code. Here,
we again train the based models on the same four tasks we
C.LLaMa CodeGen2 MistralLLaMa3 St.Coder performed in RQ1. However, RQ2 differs from RQ1, using
Base 0.04 0 0.04 0.02 0.03 a multi-architecture compilation of source code into decom-
DBVul 0.81↑ 0.85↑ 0.83↑ 0.9↑ 0.84↑ piled code. For identification in Figure 2, we see that the
performance is close to approximately 90% when we test
Table 4: RQ1: Performance of LLMs on Function Name Pre- Performance on Identification on Different
diction and Description Generation tasks. Architectures
1
Task Train Model B. R.L. B.Score Sim 0.8
Prec. F1 0.6
Function Name Prediction
Mistral
Base 0.03 0.18 0.80 0.81 0.48 tectures and tested on a different subset of architectures for
function name prediction and description generation tasks.
Our DS 0.11 0.28 0.88 0.88 0.77
CodeGen2 In Table 6, the column “Train” depicts the set of architec-
Base 0.00 0.02 0.76 0.77 0.18
tures that were present during the training and the “Tes” col-
Our DS 0.09 0.25 0.83 0.85 0.71 umn defines the set of architectures that were present during
StarCoder
Base 0.00 0.02 0.76 0.77 0.18 the testing. however, we kept some overlap in the architec-
Our DS 0.13 0.30 0.89 0.88 0.78 tures between the training and testing for comparison tasks.
LLaMa 3 All − x64 defines that the model trained will all three archi-
Base 0.02 0.18 0.83 0.83 0.50
tectures except x64, and ARM + x86 defines that the model
was only trained on ARM and x86 architecture. For function
name prediction, on Table 6, we can see that when the model
was trained without x64, we see a very slight performance
by combining all the architectures. However, we see an im- drop of only 1% on the Cosine Similarity score when tested
provement in Precision, F1, and accuracy in non-vulnerable on x64. However, when the model was trained on ARM and
or benign functions for MIPS. Moreover, we see a significant x86, we see that for x86, there was a 4% drop in the perfor-
performance drop of 2-3% overall metrics for x64 archi- mance compared to ARM, while x86 was still in the train-
tecture, wherein the performance of x86, ARM, and MIPS ing data. Furthermore, for description, when the model was
remains relatively similar. Similarly, we see mixed results trained with All−x64, the performance of x64 only dropped
on F1 score for multiclass classification of CWEs in Figure by 2% for the Cosine Similarity score, and when the model
3. For example, on CWe-121, CWE-122, CWE-427, CWE- was trained on ARM + x86, and tested with “Al” we see
665, CWE-758, CWE-758 and, CWE-789 MIPS performs almost no performance change. Furthermore, we also tested
the higher. However, for CWE-401, CWE-690, and CWE- generalizability on O0 and O3 optimization levels on func-
761, we see a relatively stable performance across all archi- tion name prediction and description tasks. For both tasks,
tectures. An interesting observation from Figure 3 is that, the model was trained on O0 optimization and tested on the
for CWE-666, the F1 score goes down to zero, which im- 03 optimization level. We see a mere 1% improvement when
plies a limitation of our dataset on CWE-666. If we follow the model was trained and tested on the same optimization
the trend line of the moving average for ”All Architecture” level. From this analysis, we can safely conclude that using
we see that, overall, the model performs lower for CWE- different architectures has almost little to no effect on func-
126, CWE-617, CWE-666, and CWE-78 while maintaining tion name prediction and description generation tasks.
good performance on the other CWEs.
To evaluate the generalizability of Large Language Mod-
For the task of function name prediction and description els (LLMs) in real-world scenarios, we conducted a small-
generation in Table 5, the Cosine Similarity score shows a scale experiment using a generalized instruction-based
lower performance of x64 of 78% while MIPS and the com- dataset. Specifically, we tested Mistral and LLaMA 3 on the
bination of all architectures shows a 4% improvement of Stanford Alpaca dataset (Taori et al. 2023), performing in-
82% on this task. For the Description generation of decom- ference on the base models prior to training with our dataset.
piled code, we see a more stable score, where ARM, MIPS, Initial cosine similarity scores were 0.67 for Mistral and 0.73
and x64 top on a 76% similarity score, wherein x86 shows a for LLaMA 3. After training the models on our proposed
merely 2% lower performance of 74%. dataset, we reassessed performance on the Stanford Alpaca
Classification Performance on Different Architectures Table 6: RQ3: Generalizability Testing with Different Ar-
1 chitectures and Optimization Levels on predicting function
0.8
name and description generation.
0.6
0.4
0.2
Task Train Test B. R.L. B.Score Sim
0 Pre. F1
All ARM 0.61 0.72 0.96 0.96 0.79
C W - 78
C W 21
C W 22
C W 24
C W 26
C W 27
C W 01
C W 16
C W 27
C W 57
C W 90
C W 06
C W 17
C W 65
C W 66
C W 80
C W 90
C W 58
CW 1
9
76
78
Function Name
1
E
E-
E-
E-
E-
E-
E-
E-
E-
E-
E-
E-
E-
E-
E-
E-
E-
E-
E-
E-
CW
Description
-x64 x64 0.12 0.29 0.88 0.88 0.75
and LLaMA 3 dropped to 0.56 and 0.70, respectively. The ARM All 0.11 0.29 0.89 0.88 0.75
notable decrease in Mistral’s performance is likely due to + x86 x86 0.10 0.29 0.89 0.88 0.75
its smaller model size (2B parameters), which led to catas- O0 0.10 0.25 0.89 0.89 0.70
trophic forgetting when trained on new data, whereas the O0
O3 0.08 0.25 0.87 0.87 0.69
7B-parameter LLaMA 3 retained much of its prior learning.
Additionally, we conducted an N-day vulnerability analysis,
where LLaMA 3 and Mistral identified 15 and 6 N-day vul-
nerabilities, respectively. al. (Al-Kaswan et al. 2023) fine-tuned the CodeT5 model
(Wang et al. 2021) on decompiled function-summary pairs,
while HexT5 (Xiong et al. 2023) extended CodeT5 for tasks
Related Work like code summarization and variable recovery. BinSum (Jin
Recent advances in binary vulnerability detection have fo- et al. 2023) introduced a binary code summarization dataset
cused on leveraging intermediate representations and deep and evaluated LLMs such as GPT-4 (OpenAI 2023), Llama-
learning techniques to address the challenges posed by code 2 (Touvron et al. 2023), and Code-LlaMa (Roziere et al.
reuse. VulHawk (Luo et al. 2023) employed an intermediate 2023) across various optimization levels and architectures.
representation-based approach using RoBERTa (Liu et al. Additionally, Asm2Seq (Taviss et al. 2024) focused on gen-
2019) to embed binary code and applied a progressive search erating textual summaries of assembly functions for vulner-
strategy for identifying vulnerabilities in similar binaries. ability analysis, specifically targeting x86 assembly instruc-
Asteria-Pro (Yang et al. 2023) utilized LSTM (Hochreiter tions.
and Schmidhuber 1997) for large-scale binary similarity de-
tection, while VulANalyzeR (Li et al. 2023a) proposed an Conclusion
attention-based method with Graph Convolution (Kipf and
Welling 2017) and Control Flow Graphs to classify vulner- In this study, we present a comprehensive investigation of
abilities and identify root causes. QueryX (Han et al. 2023) large language models (LLMs) for the classification and
took a different approach by converting binaries into static identification of vulnerabilities in decompiled code and
source code through symbolic analysis and decompilation source code to determine the semantic gap. The primary
to detect bugs in commercial Windows kernels. In the realm contribution of our work is the development of the De-
of code summarization for decompiled binaries, Kawsan et BinVul dataset, an extensive instruction-based resource tai-
lored for vulnerability identification, classification, function
name prediction, and description across four architectures
Table 5: RQ2: Function Name Prediction and Description and two optimization levels. Our experiments demonstrate
when source code compiled on different architectures. that DeBinVul significantly improves vulnerability identifi-
cation and classification by up to 30% compared to baseline
Task Arch. BLEU Rouge-L BERTScore Sim. models on the x86 architecture. Additionally, we provide an
in-depth analysis of how different LLMs perform across var-
Description Func. Name
Data Collection and Parsing Code Injection Validation and Compilation Decompilation to Binary Code and Investigation against LLM models
Figure 4: A high-level overview of our investigation process: From Vulnerability Injection to Vulnerability Analysis Using
Decompiled Code
Table 7: Number of functions and total counts in various sarial dataset, totaling 276 decompiled code samples. Then,
top C/C++ general-purpose and IoT repositories where we we compiled both repositories with the injected vulnerable
injected 8 CWE vulnerabilities into some randomly selected functions using the GCC compiler on a DGX A100 server
functions. with an x86 64 Linux operating system. Then, we used
Ghidra (NSA 2019) to decompile binaries into decompiled
Repository Name Domain Total code.
Linux Kernel Operating System 3 Although we provided GPT-4 with strict instructions on
Apache HTTPD Web Server 7 injecting vulnerabilities without creating potential errors,
OpenSSL Security Library 2 GPT-4 occasionally introduced compiler errors that would
FFmpeg Multimedia Framework 6 prevent the build of the vulnerable repository. Some of these
cURL Data Transfer 5 compiler errors included accessing fictitious fields of struc-
MQTT IoT Protocol 2 tures, calling functions that did not exist, and minor syntax
Zigbee IoT Networking 8 errors. Initially, we randomly picked 200 code samples to in-
Node.js Runtime Environment 3 ject vulnerability. However, out of the 175 samples, 62 sam-
SQLite Database 61 ples were not compilable, which we ignored.
Json-C Data Format 77 Table 11 shows the total number of decompiled vulnera-
ble functions per CWE category.
We initially extracted all the functions from these ten Reasoning of Weak Performance in
repositories to investigate their effectiveness. Then, we
randomly selected some functions to inject vulnerabilities
Investigation
which are demonstrated in Table 8. After injecting the vul- Reasoning on Poor Performance on LLMs
nerabilities and fixing the compilation errors, we compile Reverse engineers face many challenges when analyzing de-
each repository into its binaries and decompile them back to compiled code. Understanding the objective and semantics
the original code using Ghidra (NSA 2019). As a result, the of decompiled code is generally more complex than under-
decompiled versions of the functions for which we injected standing the semantics of source code. During the decompi-
vulnerability are also vulnerable. Our analysis includes iden- lation process, the variables or the flow of statements some-
tification, classification, and function name prediction of the times get changed for optimization. As a result, working
decompiled code. Table 7 summarizes the code we gener- with decompiled code for tasks such as vulnerability anal-
ated to investigate vulnerability across different LLMs. ysis or decompiled code summarization (Al-Kaswan et al.
Compiling an open-source repository is challenging be- 2023) is more challenging. Some of the primary reasons for
cause it requires many software and hardware dependencies poor performance could be directly related to the removal of
to run appropriately and be compiled into binaries listed in comments during decompilation, function name changes to
the makefile. We explored ten popular C repositories from the memory address of the function, complex control flow,
GitHub, mentioned in Table 7. Functions from these reposi- and obfuscation of variable names.
tories were used to generate an adversarial attack on code.
C1: Comments do not Exist. When source code is com-
After getting the repositories, we extracted the function
piled, the compiler typically ignores comments and no
name from the source code function by parsing the function
longer exists in the compiled binary. Therefore, comments
definition with Tree-Sitter (Brunsfeld et al. 2018) and using
are irrecoverable in decompiled code. Without comments,
an S-expression to extract the function name.
comprehending decompiled code is incredibly challenging
We randomly selected 200 functions from these reposito-
and time-consuming as it provides limited information about
ries and injected vulnerabilities using GPT-4. Each function
its purpose and intended behavior. Therefore, the reverse en-
was appended with instructions on how to inject vulnera-
gineer has to derive meaning from syntax and structure.
bility. However, some injected vulnerable code had compi-
lation errors. Therefore, we had to remove some of them, C2: Ambiguous Function Calls. When source code is
totaling 138 samples. Furthermore, we use the original non- compiled, the compiler may optimize the code by replac-
injected function as a non-vulnerable sample in our adver- ing standard function calls, such as strcpy, with a custom
Table 8: Performance of API-based and Open for Vulnerability Identification and Classification Tasks in Decompiled binaries.
function that performs the same task more efficiently. Alter- in a nested, parenthetical format. Each node in the tree
natively, the compiler may optimize the binary by inlining is represented as a list, starting with the node type fol-
the instructions of common function calls in the callee func- lowed by its children, which can be terminal tokens or fur-
tion, effectively removing the function call altogether. This ther nested lists. This format provides a clear and concise
may prove challenging for reverse engineers attempting to textual representation of the code’s syntactic structure. To
understand code semantics through commonly called func- extract the functions, we used the following S-expression,
tions. This usually happens during decompilation since the (f unction def inition)@f unc − def .
dissembler does not know the original function name, and After extracting the functions using S-expression, our
the replaced string 0x818042DOE is the function’s address next task is to separate vulnerable from non-vulnerable
in the system memory. functions. We found a specific pattern that makes this
C3: Complex Control Flow. The control flow of the task more straightforward for us. We observe that all
source code may be optimized and modified in the compiled function definitions that are extracted from the file ei-
binary. As a result, the decompiled code may have a more ther contain ”good” (a benign entry) or ”bad” (a vulner-
complex control flow of statements, which is difficult to un- able entry) in the function’s name. For each of the ex-
derstand from a reverse engineer’s perspective. A common tracted function definitions, we used another S-expression
example is loop unrolling (Sarkar 2000; Huang and Leng to extract function names from each function definition:
1999), where a loop is unraveled as a sequence of instruc- (f unction declarator(identif ier)@f unc name). Com-
tions instead of jumping back to the beginning of the loop plete definitions of these S-expressions are available in the
until a condition is met. However, these optimizations can repository we provided earlier.
sometimes be unusual and confusing when comprehending After extracting the functions and function names, our
the flow of decompiled code. next task is to classify the functions. This part is relatively
straightforward. If the function name contains the substring
DeBinVul Dataset Preparation ”good,” we consider it a benign or non-vulnerable func-
tion. However, if the function contains the sub-string ”bad,”
In this section, we highlight the technical details of how we we consider the function vulnerable. If the function is non-
extract each dataset component in detail. vulnerable, the function name has the format that appears
Function Extraction. To extract all function definitions as CWEXXX rest of function name. Therefore, we take the
from the file, we used the S-expression of Tree-Sitter. In first part of the function name (CWEXXX) to capture the
Tree-sitter, an S-expression (symbolic expression) is a way CWE number of the vulnerable code. Table 11 shows the
to represent the Abstract Syntax Tree (AST) of source code total number of decompiled functions we generated for each
CWE. Our study assesses the impact of compiler optimizations
Compilation and Optimization. Our compilation pro- on security and functionality by using the −O0 and −O3
cess strategically employs the −O0 and −O3 optimization optimization levels. The −O0 level, with no optimizations,
levels to assess the impact of compiler optimizations on the provides a direct correspondence to the source code, which
security and functionality of executables. By selecting these is essential for tracing vulnerabilities accurately. Conversely,
two extreme levels, we can thoroughly evaluate how opti- the −O3 level applies aggressive optimizations that can en-
mizations influence executable behavior in ways that inter- hance performance but also introduce or expose vulnerabili-
mediate levels, −O1 and −O2, may not fully reveal. The ties, simulating high-performance production environments.
−O0 level, which applies no optimization, ensures that the This dual approach allows us to capture a wide range of ef-
compiled code remains a straightforward representation of fects, providing a thorough analysis of how different opti-
the source code. This direct correspondence is critical for mization levels influence executable behavior. By compar-
accurately tracing vulnerabilities back to their source, pro- ing the extremes of no optimization and maximum optimiza-
viding a clear baseline for understanding the application’s tion, we ensure a robust evaluation that intermediate levels
intended behavior. might miss.
In contrast, the −O3 level introduces aggressive opti- Decompilation During decompilation, we use the extra
mizations such as function inlining, loop unrolling, and ad- flag −s. The −s flag in GCC instructs the compiler to strip
vanced vectorization. These can enhance performance and symbol information from the resulting executable, including
efficiency and potentially introduce or expose vulnerabili- function names. This significantly reduces the executable’s
ties, such as buffer overflows, specifically related to these size but also hinders debugging and reverse engineering ef-
optimization techniques. Moreover, −O3 mimics the high- forts. Stripped binaries can be more challenging to analyze
performance conditions often found in production environ- and understand, potentially making it more difficult for at-
ments, making it invaluable for simulating real-world ap- tackers to exploit vulnerabilities. However, it’s essential to
plication scenarios. This dual approach, employing both note that while -s removes human-readable function names,
−O0 and −O3, allows us to capture a comprehensive range it doesn’t obscure the underlying code logic or prevent ad-
of effects— from no optimization to maximum optimiza- vanced reverse engineering techniques.
tion—thereby providing a broad spectrum analysis of how Instruction Generation We created 80 instructions in to-
different optimization levels can affect an executable’s per- tal, and 20 instruction We created 20 instructions for each
formance, size, and, crucially, its security properties. This task, totaling 80 instructions for each task. We used four spe-
method ensures we identify any vulnerabilities introduced or cially curated prompts using GPT-4 to automatically gener-
masked by compiler optimizations, offering a robust evalua- ate the 80 instructions for all of the four tasks in decompiled
tion that intermediate optimization levels might overlook. binary analysis.
Post-processing. Following the extraction, compilation,
We utilized the following compiler commands: gcc − and decompilation of functions using Ghidra, several post-
O0(x86), gcc − O3(x86), clang − O0(x86), clang − processing steps were undertaken to ensure the dataset’s
O3(x86), aarch64 − linux − gnu − gcc − O0 (ARM), quality and suitability for vulnerability analysis. First, in-
and aarch64 − linux − gnu − gcc − O3 (ARM). The stances of gibberish code that would not compile were iden-
−DIN CLU DEM AIN option was included to define the tified and removed to prevent skewed results. Additionally,
main function necessary for compiling the CWE code. We empty methods containing only method names without ex-
compiled the source code twice for each compiler command ecutable code were eliminated, as they provided no value
using the −DOM IT GOOD and −DOM IT BAD options to the analysis. We also encountered numerous codes that
to generate the vulnerable and benign executables respec- were either identical or too similar, differing only in variable
tively. This systematic approach ensured that we could thor- names (e.g., ‘srini string‘ versus ‘string srini‘); these redun-
oughly examine the impact of different compilers, optimiza- dancies were systematically removed to maintain dataset
tion levels, and code variants on the security properties of diversity and prevent bias. Furthermore, many CWE cate-
the executables. gories lacked viable code examples due to insufficient train-
ing data in the original CWE dataset. To address this, we em-
ployed synthetic code generation and semi-supervised learn-
Table 9: A brief description of the six different LLMs we
ing techniques to augment the dataset, thereby increasing
tested to analyze their performance on analyzing code vul-
the representation of underrepresented CWEs. These post-
nerability.
processing steps were crucial in refining the dataset, ensur-
ing it was robust, diverse, and ready for accurate vulnerabil-
Model Src. Par. Modality Access ity analysis. Table 10 briefly overviews our proposed dataset.
GPT-4 OpenAI 1.7T Text/Code
API Evaluation Metrics
Gemini Google ∼ Text/Code
BLEU Score. The BLEU (Papineni et al. 2002) score is a
CodeLLaMa Meta AI 7B Code
syntax-based way of evaluating machine-generated text be-
CodeGen2 Salesforce 7B Code
tween 0 and 1. It is a reference-based metric and may not
LLaMa 3 Meta AI 8B Text/Code Open
capture all aspects of translation quality, such as fluency or
Mistral Mistral AI 7B Text/Code
semantic accuracy.
Table 10: Binary Functions across four Computer Architectures and two different Optimization Levels. Our proposed dataset
DeVulBin contains 150,872 binary functions, over 40 CWE Classes
Table 11: CWE Count Per Class of Decompiled Binary Code the final result. It is a token-based similarity representation
vector that represents tokens that permit generating a soft
CWE Number Count CWE Number Count similarity measure instead of exact matching since secure
code can be generated in various ways.
CWE-127 3344 CWE-666 799
CWE-590 3330 CWE-510 699 Cosine Similarity. BLEU, Rouge-L, BERTScore com-
CWE-124 3340 CWE-426 660 pares n-gram-based exact token matching or embedding-
CWE-121 3340 CWE-467 479 based token matching. However, the exact meaning can be
CWE-401 3339 CWE-415 370 kept intact for natural language while the description is
CWE-122 3339 CWE-506 340 written differently. Thus, we aim to measure the seman-
CWE-761 3339 CWE-475 320 tic similarity between the embedding of the LLM gener-
CWE-690 3339 CWE-319 319 ated text and the ground truth. Formally given the set of
CWE-126 3338 CWE-464 300 sequences of tokens generated by LLM description D =
CWE-427 3338 CWE-459 260 {w1 , w2 , ..., wn } and the ground truth reference description
CWE78 3338 CWE-773 230 D̂ = {wˆ1 , wˆ2 , ..., wˆn }. Then we used the sentence encoder
CWE-789 3338 CWE-476 180 E to generate the embedding E(D) = {ew1 , ew2 , ..., ewn }
CWE-606 3338 CWE-605 180 and E(D) = {ewˆ1 , ewˆ2 , ..., ewˆn }. Thus the entire semantics
CWE-680 2797 CWE-469 160
of D and D̂ are represented by:
CWE-665 1696 CWE-675 140
CWE-758 1677 CWE-404 100 1 X
m
1 X
n
CWE-123 1349 CWE-775 70 eD = ewi , eD̂ = ewj (1)
CWE-617 1087 CWE-681 70 |D| i=1 |D̂| j=1
CWE-457 980 CWE-688 40
CWE-416 830 CWE-685 30 Therefore, we calculate the similarity score as,
eD · eTD̂
sim(D, D̂) = (2)
Rouge-L. Similar to BLEU, Rouge-L (Lin 2004) score is ||eD || · ||eD̂ ||
also a number between 0 and 1 to measure the syntax-based
similarity of two generated texts. It generates a score by Further Discussion on RQ1.
quantifying precision and recall by examining the longest For the classification task, from Table 13, we can see that
common subsequence (LCS) between the generated and ref- CodeGen2 and Starcoder show the highest performance
erence codes. across most CWEs, with an F1 score over 90% for CWE-
590, CWE-121, and CWe-122. Furthermore, Mistral also
BERTScore. Furthermore, we use BERTScore (Zhang shows similar performance across all other CWEs. However,
et al. 2019) for semantic comparison using cosine similar- CodeLLaMa shows relatively poorer performance. How-
ity score to identify how the generated token matches the ever, we also notice that some CWEs like CWE-121, CodeL-
ground truth tokens. BERTScore generates an embedding LaMa, and LLaMa 3 show a 100% accurate performance.
vector for each generated token, performs a cosine similarity Moreover, Mistral (Jiang et al. 2023) shows a 100% accu-
with all the ground truth tokens, and averages it to generate racy for CWE-789 and CWE-123, which further elucidates
Table 12: The instructions we used to investigate LLMs to determine their performance on decompiled code vulnerability
identification and classification tasks.
CWE
Model Metric 401 690 124 761 590 121 789 122 606 127
Pre. 0.83 0.9 0.94 0.97 0.93 1 0.68 0.59 0.71 0.93
CodeLLaMa Rec 0.51 0.79 1 0.9 0.9 1 0.48 0.96 0.52 0.61
F1 0.63 0.84 0.97 0.93 0.91 1 0.57 0.73 0.6 0.74
Pre. 0.98 0.98 0.81 1 0.86 0.86 0.92 1 0.79 0.59
CodeGen2 Rec 1 1 0.62 1 0.97 1 1 0.91 1 0.45
F1 0.99 0.99 0.7 1 0.91 0.93 0.96 0.95 0.88 0.51
Pre. 0.84 0.89 0.94 0.97 1 0.91 0.75 0.74 0.73 1
Mistral Rec 0.57 0.97 1 1 0.83 1 0.44 0.96 0.48 0.39
F1 0.68 0.93 0.97 0.98 0.91 0.95 0.56 0.84 0.58 0.56
Pre. 0.9 0.94 0.97 0.89 1 1 0.76 0.69 0.83 0.93
LLaMa 3 Rec 0.76 1 1 1 0.9 1 0.81 1 0.83 0.61
F1 0.82 0.97 0.99 0.94 0.95 1 0.79 0.82 0.83 0.74
Pre. 0.78 0.69 0.87 0.94 1 0.97 0.79 0.93 0.77 0.92
StarCoder Rec 0.57 0.85 1 1 0.9 1 0.41 1 0.74 0.52
F1 0.66 0.76 0.93 0.97 0.95 0.98 0.54 0.96 0.76 0.67
122 121 789 680 758 416 665 457 123 617
Pre. 0.9 0.51 1 0.78 0.94 1 0.62 0.73 1 0.75
CodeLLaMa Rec 0.86 1 0.81 0.9 0.89 0.87 0.71 0.92 0.73 0.9
F1 0.88 0.68 0.89 0.84 0.91 0.93 0.67 0.81 0.84 0.82
Pre. 0.95 0.95 0.94 1 0.92 0.76 0.7 0.8 1 0.57
CodeGen2 Rec 0.86 0.9 0.83 0.88 0.71 1 0.93 0.92 0.83 0.44
F1 0.9 0.92 0.88 0.94 0.8 0.86 0.8 0.86 0.91 0.5
Pre. 0.83 0.49 1 0.83 0.94 0.82 0.68 0.79 1 0.89
Mistral Rec 0.91 1 1 0.95 0.94 0.93 0.93 0.92 1 0.8
F1 0.87 0.66 1 0.88 0.94 0.88 0.79 0.85 1 0.84
Pre. 0.91 0.84 0.95 0.95 0.95 1 0.87 0.92 1 0.82
LLaMa 3 Rec 0.91 1 1 0.9 1 0.93 0.93 0.92 1 0.9
F1 0.91 0.91 0.98 0.92 0.97 0.97 0.9 0.92 1 0.86
Pre. 0.87 0.5 0.78 0.95 0.93 1 0.86 0.73 1 1
StarCoder Rec 0.91 1 1 1 0.78 0.93 0.86 0.92 0.82 0.8
F1 0.89 0.67 0.88 0.98 0.85 0.97 0.86 0.81 0.9 0.89
Table 14: Prompts used as an input for GPT-4 to generate 20 Instructions for each task, namely, i) Identification, ii) Classifica-
tion, iii) Function Name Prediction and iv) Description
Task Prompt
Consider yourself as a Code Security Reverse engineer Specialist.
Task: You are to generate a prompt to ask a large language model to Identify Vulnerabilities in decompiled
code. Make sure that the prompt ensures the answer is in ”Yes/No” format and that no extra content is
generated by the LLM when using the prompt.
Now, create 20 different variants of these prompts.
Here are two sample examples:
Ident. 1. ”You are a Reverse Engineer of Code. Detect the presence of events of vulnerabilities that exist in the
given code.
If a vulnerability exists, answer ”YES”, otherwise, answer ”NO”. Do not produce any extra outputs.
2. ”As an event Vulnerability expert of decompiled code, analyze the code to find whether any
vulnerability exists in the following code.
If you think the code has vulnerability, produce a single word ”YES”, and if vulnerability does not exist,
reply ”NO” only. Do not output any extra words or URLs.
Consider yourself as a Code Security Reverse engineer Specialist.
Task: You are to generate a prompt to ask a large language model to Classify Vulnerabilities in decompiled
code. The classification is based on the CWE numbers. Make sure that the prompt ensures the answer is in
”CWE-XXX”
format and that no extra content is generated by the LLM when using the prompt. Here ”XXX” refers to
the vulnerability number.
Now, create 20 different variants of these prompts.
Here are two sample examples:
Class. 1. ”You are a Reverse Software Engineer of Decompiled Code. Classify the presence of vulnerabilities
that exist
in the given code.
Generate the vulnerability classification in the following format only ”CWE-XXX”. Do not produce
any extra outputs.
2. ”As an expert classifying Vulnerability in decompiled code, analyze the code to find the
vulnerability category existing in the following code. Print only the CWE number of the vulnerability
in this format ”CWE-XXX”.
Do not output any extra words or URLs.
Consider yourself as a Code Security Reverse engineer Specialist who can predict function names on
decompiled binary code.
Task: You are to generate a prompt to ask a large language model to predict function name in decompiled code.
The function name can only have the characters supported in C/C++ programming language, and the outcome
would only consist of a single word, and no extra content is generated by the LLM when using the prompt.
Now, create 20 different variants of these prompts.
Func.
Here are two sample examples:
Name
1. ”You are a Reverse Software Engineer of Decopiled Code. Predict the name of the decompiled code.
Pred.
Generate the function name only in a single word. You can use camelCasing or Snake Casing.
Do not produce any extra outputs.
2. ”As an expert classifying Vulnerability in decompiled code, analyze the code to determine
function name in the following code.
Print only the Function name in a single word. You are allowed to use snake casing or camelCasing
to generate the function names. Do not output any extra words or URLs.
Consider yourself as a Code Security Reverse engineer Specialist who can generate the description of a
decompiled code.
Task: You are to generate a prompt to ask a large language model to Describe the objective
and/or Vulnerabilities in decompiled code. The description should explain the flow of the code but not
specify any variable or function names. The descriptions should be generic enough to be used in a
decompiled code as well.
Now, create 20 different variants of these prompts.
Here are two sample examples:
Desc.
1. ”You are a Reverse Software Engineer of Decopiled Code. Describe the objective and the security
issues that exist in the given code. Make sure you generate a generic description of the vulnerability
without specifying function or variable names. Please ensure that the generated description
can be used in decompiled code.
2. ”As an expert explaining the objectives and vulnerability in decompiled code, analyze the code
to explain the objective and vulnerability. Ensure the generalizability of the description by not
mentioning the function and the variable names as the description will be used in a decompiled
code where the variable and function names are obfuscated.”
Table 15: Average F1 Score Comparison on Base vs. Trained LLMs for Classification Task.