Patna
Patna
Disassembler
Process N
100111000010
101111101011
Code Generating LLMs 100111101010
b) LLM Obfuscation Engine
Figure 1: Traditional Obfuscation Engine v/s LLM Obfuscation Engine (M ETAMORPH ASM B ENCHMARK). a) The target
code is fetched from the main memory disassembled, and fed to the code analyzer. The configuration unit provides metadata
to the code analyzer. The code analyzer provides assembly code to obfuscation units, assembles them, and deploys binary into
the main memory. b) In the LLM obfuscation engine, we have to fetch from the main memory, disassemble it, and feed it into
LLM. The LLM generates obfuscated code and sends it to the assembler to make a binary and deploy it to the main memory.
(c) Code Analyzer: This unit enhances the reliability of the unique resource to perform a more detailed analysis of
obfuscation engine by providing extra information and anal- obfuscation strategies and evaluate the resilience of cur-
ysis to the obfuscator units. (d) Obfuscator Units: Depend- rent detection technologies.
ing on the purpose and complexity of the obfuscation en- • Baseline Code-Generative Models: We propose a series
gine, these components are responsible for obfuscating code of baseline generative models, both a language model
based on code analysis. For more complex obfuscation, ad- and LLMs, that are either trained, zero-shot prompted,
ditional units are required. or in-context-learned on our dataset, and evaluate them
We replace the code analyzer and obfuscator units with with the automatic scores and conduct a human review
an LLM, as illustrated below in Figure 1. The use of an to inspect the obfuscation abilities.
LLM offers several potential advantages: (a) Ease of Gen- • We provide three distinct types of obfuscation each with
eration and Training: LLMs require minimal development 109400 samples (with an average code length from 399
and debugging compared to the classic method, which in- to 507 for both original code and obfuscated/modified
volves extensive development and testing. (b) Platform In- code): Dead Code Insertion, Register substitution, and
dependence: Unlike many classic obfuscation engines that Control Flow Change, as the performance of LLMs can
are platform-dependent (e.g., Windows or Linux), LLMs are vary based on the specific obfuscation techniques used
generally platform-independent and do not require special and the complexity of the code.
adjustments (c) Cost Efficiency: The cost of generating an
LLM model is substantially lower than the development of a Background of Code Obfuscation Techniques
classic obfuscator engine with traditional programming lan- Mathematically, we can show obfuscation with this defini-
guages such as C/C++ or JAVA. (d) Ease of Updating and tion: Code obfuscator is defined as a function f that trans-
Maintenance: Updating and maintaining an LLM model is forms original source code P into an obfuscated version
relatively straightforward. of source code P ′ . Formally, this can be represented as
Our contributions include the following: f : P −→ P ′ . Where P is the space of all possible programs
• The M ETAMORPH ASM dataset (MAD): A dataset and P ′ is the space of all possible obfuscated programs.
comprises 328,200 assembly code samples specifically The obfuscation function f must ensure that the obfus-
crafted to test the ability of LLMs to perform code obfus- cated program P ′ behaves identically to the original pro-
cation. To our knowledge, this is the first assembly code gram P for all inputs. So, we will have:
obfuscation dataset, which provides researchers with a ∀x ∈ X P (x) ≃ P ′ (x)
Where X represents the set of all possible inputs x for which Listing 3 Original code after register substitution
P (x) and P ′ (x) have a valid outcome without any error. 1 83C31C ADD EBX, 28 ;Swap EAX by EBX
2 8BE5 MOV ESP, EBP
Listing 1 Original code 3 83E301 AND EBX, 1 ;Swap EAX by EBX
4 0F94C1 SETE CL
1 83C01C ADD EAX, 28 5 42 INC EDX
2 8BE5 MOV ESP, EBP 6 83EF01 SUB EDI, 1
3 83E001 AND EAX, 1 7 56 PUSH ESI
4 0F94C1 SETE CL 8 3BF9 CMP EDI, ECX
5 42 INC EDX 9 57 PUSH EDI
6 83EF01 SUB EDI, 1
7 56 PUSH ESI
8 3BF9 CMP EDI, ECX
9 57 PUSH EDI its original functionality. This rearrangement alters the se-
quence of instructions without changing the overall behavior
of the malware. The purpose of instruction permutation is to
Dead Code Insertion: In dead code insertion, malware in- disrupt the linear flow of the code and introduce variabil-
serts sections of code that are irrelevant to the program’s ity in the instruction sequence (see Listing 4, in comparison
normal operation. This technique can take various forms, with Listing 1. By constantly shuffling the order of the in-
such as adding redundant instructions, unused variables, structions, the malware presents a different code structure
or unreachable code branches. These additional code seg- each time it is executed, making it challenging for antivirus
ments alter the structure and behavior of the malware with- programs to detect and analyze (Linn and Debray 2003).
out affecting its functionality, causing it to appear different
each time it is executed. Consequently, traditional signature- Listing 4 Original code after code flow change
based antivirus detection methods become less effective 1 EB JMP sec1
against metamorphic malware employing this technique. 2 sec3:
Listing-1 shows a snippet assembly code which we called 3 83EF01 SUB EDI, 1
“original code” and Listing-2 shows the original code after 4 56 PUSH ESI
inserting dead code, such as “NOP” or “MOV EDI, EDI”. 5 3BF9 CMP EDI, ECX
Inserting dead code or garbage code, is one of the important 6 57 PUSH EDI
7 EB JMP sec4
method among the obfuscation techniques (Na, Choi, and
8 sec2:
Lee 2023). 9 83E001 AND EAX, 1
10 0F94C1 SETE CL
Listing 2 Original code after inserting dead code 11 42 INC EDX
1 83C01C ADD EAX, 28 12 EB JMP sec3
2 90 NOP ;Dead code 13 sec1:
3 8BE5 MOV ESP, EBP 14 83C01C ADD EAX, 28
4 83E001 AND EAX, 1 15 8BE5 MOV ESP, EBP
5 8BFF MOV EDI, EDI ;Dead code 16 EB JMP sec2
6 0F94C1 SETE CL 17 sec4:
7 42 INC EDX
8 83EF01 SUB EDI, 1
9 90 NOP ;Dead code M ETAMORPH ASM DATASET (MAD)
10 56 PUSH ESI
11 3BF9 CMP EDI, ECX The MAD, consisting of generated assembly code snippets,
12 57 PUSH EDI is generated through a four-step process:
Step 1. The source of assembly codes. The source code
comes from the extraction and disassembling of a large
Register Substitution: In this technique, the malware re- number of Dynamic Link Library and Program Executable,
places register names used within its instructions with alter- which Microsoft provided for Windows users, specifically
native register names. For example, if the original malware Windows 7 and Windows 8.1. We used Windows Dynamic
code uses the “EAX” register for a specific computation, Link Library and Executive files because most of the meta-
the register substitution technique might replace instances of morphic victims in the past and present are Windows users,
“EAX” with “EBX” or another available register. Although and malware uses many standard libraries of .Net for reshap-
the functionality of the code remains the same, altering the ing code. We also use many open-source x64-based real as-
register names changes the code structure from its original sembly files or static libraries to generate assembly files.
form. Listing 3 illustrates the register substitution technique Step 2. Code extraction and pre-processing. We use com-
(Balakrishnan and Schulze 2005). mand prompt and open source software such as recompil-
ing and disassembly tools to generate assembly code from
Control flow change: In this technique, the malware rear- original files. Most of the assembly code has a large sec-
ranges the order of instructions in its code while preserving tion of data, which does not have a useful code for train-
ing LLMs. During the pre-processing step, we remove these that contains 24 to 25 lines of assembly code. For the register
sections and use only code sections to generate datasets. An- substitution sample, we apply register swapping to the orig-
other consideration for pre-processing is removing and purg- inal code and record it as its corresponding obfuscated code.
ing near- and far-range JMPs or CALL instructions from Generally, each entry in the register substitution set will in-
the original code because all of these parameters are re- clude at least one register swap, and the size of the swapped
lated to the local machines and temporary files, which causes code remains the same as the original code. For control flow
a loss of generality concept during the training process of changes, we include 3 to 4 JMP instructions and their re-
LLM. After cleaning up these large quantities of opcode, lated labels, with the control flow of the program randomly
and after human evaluations and verification of large cor- altered for each entry. In all these obfuscated codes, the core
pus, which took almost two months as full-time laboring, functionality of the original code remains unchanged, but the
we break down this large corpus assembly code into small structure of the code differs from the original.
snippet assembly, each typically comprising twenty instruc-
tions. These snippets were then stored. Models and Evaluation
Step 3. Obfuscating assembly codes. After generating the
clean assembly code snippets, the next step is to obfus- Models: We conducted the benchmarking of M ETAMOR -
cate each snippet using specific Python scripts. These scripts PH ASM utilizing a diverse array of large language mod-
are designed to create three separate databases, each corre- els (LLMs), which we categorized into open source (o) and
sponding to a different obfuscation technique: dead code in- proprietary (p) types, further divided into a mixture of ex-
sertion, register substitution, and code flow alteration. To in- perts (MoE) and non-mixture of experts (n-MoE) models.
sert dead code, we use a dictionary of nearly 40 assembly in- Access to proprietary LLMs was facilitated through APIs,
structions that do not affect the code’s functionality but alter which are computationally demanding and financially ex-
its structure. The script randomly inserts these “neutral as- pensive. Therefore, we selected 15,000 assembly code sam-
sembly instructions” and saves the output in a Python dictio- ples - 5,000 per obfuscation mechanism - to ensure a fair
nary as key-value pairs. For the code flow alteration dataset, and reasonable comparison - from our extensive reposi-
the script reads each entry from the original database (cre- tory of 300,000 examples. For our proprietary LLM eval-
ated in step 2) and randomly rearranges parts of the code to uations, we included the GPT-4o-mini (p) (Achiam et al.
obfuscate the original assembly snippet, then saves the result 2023), an MoE model adept at handling complex assembly
in the code flow change database. The final register substi- code patterns (OpenAI 2024). Competing against GPT-4o-
tution database involves reading the original database and mini, we employed the open-source DeepSeekCoder-v2 (o;
randomly renaming specific registers (e.g., EAX, EBX, or 236B; MoE model)(Zhu et al. 2024) and Codestral (Mis-
ECX) by swapping them with other unused registers, with tralAI 2024) (o; 22B; n-MoE model), both of which have un-
the results saved in the register dataset. By merging these dergone extensive training on assembly codes and are profi-
three data into one unified MAD dataset. cient in generating such codes. Considering their established
Step 4. Final validations and verification. The original effectiveness in advanced coding tasks, CodeLLAMA (o;
code and the obfuscated code generated by three techniques 34B) (Roziere et al. 2023) and LLAMA 3.1(MetaAI 2024)
is manually evaluated by human experts who have more than (p; 405B; n-MoE) were also included in our benchmarking
twenty years of experience in assembly and machine code process. Lastly, CodeGemma (o; 7B)(Team 2024) and the
development. In order to find any type of bug or defect, such trainable CodeT5 (o; 1.2B) (Wang et al. 2021) were consid-
as unwanted characters or wrong syntax, for removal. In the ered models suitable for conducting future white box stud-
end, we package our datasets in Excel sheet format, which is ies into obfuscation using LLMs. Except for CodeT5, all
ready to train models like CodeT5 and examine other pow- the LLMs were subjected to examination using zero-shot
erful code-generating LLMs. prompting (refer to Table 1 for the zero-shot control flow
change prompting template. The prompt templates for Dead
Dataset Metrics Code and register substitutions can be found in the supple-
mentary code materials.) and in-context learning, with test
The MAD focuses on three major obfuscation techniques:
setups including 1, 3, 5, 10, and 15 shots (see Table 2 for
dead code insertion, register substitution, and control flow
few shot prompting template).
changes. The MAD includes 109,400 entries for each ob-
fuscation technique, structured as (original code, obfuscated
code) pairs. The first item in each pair represents the origi- Evaluation
nal, unobfuscated code, while the second item contains the To measure the obfuscation level, we consider two metrics.
assembly code modified using one of the obfuscation tech- First, we compute the character-wise Delta Entropy (∆),
niques. Since MAD is designed for experiments with LLMs which is a derived measure from Shannon Entropy. Ana-
rather than real-world applications, each text entry (repre- lysts commonly use this measure as a first-pass analysis,
senting the original code) contains only twenty lines of as- with original code and obfuscated code. In the context of
sembly code. This size of code helps ensure minimal risk of code obfuscation, it quantifies the complexity and diversity
future misconduct or misuse of the dataset. of the code. In fact, it gives us criteria regarding code mu-
For each dead code sample, we embed 4 to 5 dead code tation from original to obfuscated. For a given pair of as-
instructions into the original code and save it as its corre- sembly codes (original and obfuscated), we convert a snip-
sponding obfuscated code, resulting in obfuscated dead code pet of original and generated assembly code into a sequence
Zero Shot Control Flow Change Prompt whole-re-writing (large change) or no edits (i.e., no change
Prompt: Assembly Control Flow Change in obfuscation is a in score). There is a middle ground of plausibly valid trans-
technique where the order of instructions is rearranged with- formations, and our use of manual expert evaluation over
out altering the program’s overall functionality. The goal is
to make the code harder to understand and reverse-engineer.
two months remediates this final uncertainty.
Control Flow Change leverages the fact that some instruc-
tions can be reordered safely if they are independent, mean- Results and Discussion
ing they do not depend on each other’s results. Given the Interpretation: The MAD includes both original and ob-
following original assembly code, determine which instruc-
tions can be safely reordered. Rearrange the identified in-
fuscated code, with an expected delta entropy range of
dependent instructions to achieve obfuscation. Just print the around 10%-20%. This range is crucial for defining an ef-
output code. fective obfuscation engine; a delta entropy exceeding this
Original Assembly Code: range risks altering the code’s functionality, while a value
PUSH EDI below 10 percent indicates minimal obfuscation. The range
POP EDI was defined after three human experts examined the code
MOV EAX ... obfuscation from eight LLMs and picked GPT-4o-mini as
Response: the closest to human-performed code obfuscation (see Ta-
JMP sec1 ble 6). Additionally, maintaining a cosine similarity above
sec4: 0.9 is essential, as it confirms the preservation of functional
PUSH EAX ... similarity between the original and obfuscated code, thereby
serving as a measure of the obfuscation’s success in main-
Table 1: Zero Shot Control Flow Change Prompt Structure.
taining the code’s integrity without compromising its func-
tionality. The threshold for cosine similarity was set follow-
Few Shot Prompt Structure
Prompt: Zero Shot Dead Code / Code Substitution / Control
ing human evaluation, where experts reviewed the top three
Flow Change + For Example: LLMs across three obfuscation techniques. We calculated
Original Code: the cosine similarity between the original and obfuscated
PUSH EDI code produced by top-3 LLMs, achieving an average of 0.9.
MOV EDI, DWORD PTR SS:[EBP+4]
PUSH 4 ... Discussion: In a comparative analysis of general-purpose
Augment k more examples for k few shot LLMs, LLAMA 3.1 exhibits notably underperformed, es-
Original Assembly Code: pecially in control flow change techniques, where it only
PUSH EDI achieves a 4.27% entropy rate in single-shot scenarios, high-
POP EDI lighting its inadequate code mutation capabilities. In more
MOV EAX ... complex tasks requiring 10 to 15 shots, LLAMA 3.1 fails to
Response: generate any valid assembly instructions and demonstrates
MOVZX EAX, AL considerable variability in cosine similarity, deviating from
NEG EAX the expected range of 0.90 to 0.97. In contrast, GPT-4o-
ADD EAX, 0 ... mini demonstrates robust performance across both entropy
and cosine similarity metrics, excelling particularly in con-
Table 2: Few Shot Prompt Structure. trol flow change obfuscation with high entropy due to the
insertion of numerous JMP and Section commands. Follow-
of symbols and apply entropy to these sequences. Then, we ing closely, dead code insertion shows commendable results,
subtract the entropy of the original code from the generated and register substitution ranks third, indicating lower en-
code to measure the amount of obfuscation. It is defined as: tropy typically associated with changes in one or two reg-
PN ister names. Although GPT-3.5 outperforms LLAMA 3.1, it
∆HAB = N1 x∈n (| H(A) − H(B) |) slightly trails GPT-4o-mini but maintains a cosine similarity
Where H(A) and H(B) stand for obfuscated code and origi- within the desired range of 0.90 to 0.97.
nal code. Second, we calculate the cosine similarity (CS) us- Codestral stands out among specialized large language
ing its standard formulation to assess the similarity between models for its effective performance in dead code and con-
the original and obfuscated code. trol flow change tasks, with cosine similarity values ranging
Functional correctness is an alternative method but it is from 0.90 to 0.98 (see Tables 3, 5). However, it struggles
impossible at this scale. Prior works use SMT solvers to with register substitution, indicating difficulties in modify-
validate compiler transforms are NP-Complete (Yang et al. ing register names effectively Table 4. DeepSeekCoder fol-
2024) and only guarantee soundness (i.e., no false equiva- lows with higher performance, reflected in its elevated co-
lents) but not completeness (i.e., no missed equivalences), sine similarity scores. These suggest that it accurately repli-
and tools used in practice expect some lifting from ASM to cates the original assembly code, hinting at its specialized
a higher representation (Montolio 2023). For this reason, we training in assembly language, making it a proficient ob-
use Delta Entropy, which was proposed/evaluated by Yang fuscator across all techniques. In contrast, CodeGemma and
et al. (2024) , who found it an effective means of large-scale CodeLLAMA show inadequate results in the three obfus-
evaluation, as too dramatic a change in score is a strong in- cation techniques, primarily due to their training in high-
dicator that the code has changed too significantly by either level programming languages like C/C++/C#, Java, Rust,
0-Shot 1-Shot 3-Shot 5-Shot 10-Shot 15-Shot
LLMs ∆(%) CS ∆% CS ∆% CS ∆% CS ∆% CS ∆% CS
GPT-4o-mini 26.90 0.93 21.00 0.95 17.50 0.95 20.70 0.95 19.30 0.96 22.33 0.95
GPT-3.5 10.22 0.93 17.34 0.90 3.82 0.80 0.98 0.83 5.74 0.77 2.88 0.77
DeepSeekCoder-v2 19.50 0.99 22.00 0.99 25.50 0.99 26.40 0.99 27.00 0.99 27.50 0.99
Codestral 30.25 0.95 15.47 0.96 14.53 0.97 13.67 0.97 12.10 0.98 11.93 0.98
Starcoder 61.35 0.68 45.55 0.97 56.25 0.97 56.40 0.97 58.70 0.97 57.28 0.97
CodeGemma 2.48 0.30 2.31 0.31 1.60 0.38 1.40 0.40 1.51 0.40 1.60 0.4
CodeLlama 2.20 0.33 2.04 0.32 1.57 0.37 1.39 0.39 1.46 0.41 1.55 0.41
LLama3.1 0.02 0.51 2.36 0.39 0.11 0.90 0.06 0.91 N/A N/A N/A N/A
Trained on MAD
CodeT5 0.06 0.97 - - - - - - - - - -
Table 3: Experimental results of the baseline models on the Dead Code Insertion obfuscation. As we can observe, ∆
entropy for Dead Code Insertion ranges from 10% to 20% due to inserting dead code into the original code for the top four
models. The cosine similarity between 0.9 and 0.98 represents this technique’s expected level of obfuscation. “N/A”: LLM
stopped generating assembly codes. “-”: The model cannot be prompted with a few shot templates.
Table 4: Experimental results of the baseline models on the Register substitution obfuscation. As we can observe, ∆
entropy for Register Substitution is very low in the top three models due to the only changing name of registers. The CS
indicates that, in general, the similarity between two snippet codes is very high because of the swapping register names.
Table 5: Experimental results of the baseline models on the Control Flow Change obfuscation. As we can observe, ∆
entropy for Control Flow Change is high in the top three models due to the insertion of the couple of JMPs instructions and
section labels. Also, we have cosine similarity in the range 0.91 to 0.94, which shows 6% to 9% code obfuscation.
and Python rather than assembly. This leads to significant in- change is the most effective obfuscation technique across
accuracies and irrelevant outputs. StarCoder, while capable specialized LLMs, followed by dead code insertion. Register
of generating assembly code, demonstrates a high variabil- substitution appears weaker, with a higher susceptibility to
ity in entropy, suggesting it understands assembly but fails de-obfuscation. CodeT5, despite being fine-tuned, produces
to consistently obfuscate at this level. Similarly, control flow high cosine similarity in register substitution, around 0.98,
Obfuscation GPT-4o-Mini GPT-3.5 Starcoder CodeLlama CodeGemma CodeT5 Codestral DeepSeekCoder-v2
Deadcode 1.67 3.00 6.00 7.67 8.67 7.33 4.67 1.33
Register 1.00 4.33 6.00 8.67 8.33 6.67 3.67 2.00
Control Flow 1.67 4.33 7.00 8.33 8.33 5.67 3.67 1.33
Table 6: Human evaluation of various LLMs on the MAD. GPT-4o-mini and DeepSeekCoder-v2 were identified as the top
two LLMs. There was ambiguity among evaluators about which model to rank as the third-best due to a tie between Codestral
and GPT-3.5. Llama 3.1 was excluded from consideration due to its significantly low obfuscation rate and inability to generate
obfuscated assembly code.
indicating a strong resemblance to the original code. Yet, sembly of a program at high computational cost by lever-
its low entropy suggests minimal to no actual obfuscation, aging domain knowledge re-writers to maintain semantic
often merely reproducing the original code as obfuscated. preserving changes, but note that this may not correctly
Human Evaluation: We designed three criteria to assess handle any self-referential code (e.g., a checksum used to
the effectiveness of obfuscation techniques in various code- branch) (Wong et al. 2023). This cost increases in our setting
generating language models, each rated on a scale from 1 to of LLM based code changes as we can not leverage any do-
8. The criteria are: (a) ranking the eight outputs based on the main expert system to accelerate an equivalency check (Lim
insertion of ineffective code, (b) ranking based on the sub- and Nagarakatte 2019). Since general code equivalence is
stitution of registers, and (c) ranking based on the rearrange- NP-hard, such domain knowledge is required and known to
ment of code sequences. We chose 200 random assembly be sound, but not complete (Tristan, Govereau, and Mor-
code samples for this evaluation, conducted by three experts risett 2011).
specializing in malware analysis. The results are in Table 6,
where a lower score signifies higher-quality obfuscation.
Conclusion
Related Work
In this work, we provide MAD, which is a dataset for as-
Numerous classical software obfuscation techniques have sembly code obfuscation for prompting and in-context learn-
been developed to safeguard against software tampering and ing of the LLM. Our dataset can obfuscate snippets assem-
reverse engineering, thereby preventing unauthorized access bly code by applying three major obfuscation techniques:
to source codes (Nagra and Collberg 2009; Hosseinzadeh a) inserting dead instructions code, b) register substitution,
et al. 2018; Xu et al. 2020; Ahire and Abraham 2020). and c) changing control flow. For the purpose of demon-
Tools such as LOKI and OBFUS reflect practical applica- strating the trainability and reliability of our dataset, we
tions of these methodologies (Schloegel et al. 2022; Kang tested our dataset by pre-training and prompting a couple of
et al. 2021). The LLVM (Low-Level Virtual Machine) is well-known models, such as the GPT family, CodeLLAMA,
particularly notable for its flexibility and extensibility in CodeGemma, Starcoder, Codestral, and DeepSeekCoder-v2.
both obfuscation and de-obfuscation, commonly employing We also fine-tuned CodeT5 on our dataset, leveraging its
techniques like control flow alteration and dead code in- open-source nature and transparent, white-box architecture.
sertion (Junod et al. 2015; Garba and Favaro 2019). This In order to measure the performance of models, we used
study extends existing research by examining the potential Cosine similarity and Shannon entropy to measure the level
of LLMs to develop obfuscation engines (Gupta et al. 2023). of obfuscation between the original code and the generated
While much of the existing research in this field concen- code by the models. As shown in this paper, surprisingly,
trates on detection and defense, this effort focuses on uti- the GPT family (which is not a special coder LLM) has out-
lizing LLMs trained in high-level programming languages, standing performance for obfuscation assembly code over
which are traditionally easier for experts to manage and un- even specialized coder LLM such as DeepSeekCoder-v2,
derstand (Muennighoff et al. 2023). However, training these Codestral, CodeLLAMA, CodeGemma, and Starcoder. The
models presents significant challenges due to the syntactic experiments demonstrated that even the pre-trained mod-
diversity and complexity of programming paradigms, requir- els show high performance on the obfuscation task, but it
ing substantial resources (Hou et al. 2023). Our dataset and does not necessarily lead to high grounding performance,
trained models can test the robustness and reliability of tra- and GPT is still dominant.
ditional and LLM-based detection systems. MAD enables
studying other challenges in malware dataset construction,
such as lack of diversity, data augmentation, and availabil- Acknowledgements
ity (Saul et al. 2024; Liu et al. 2024; Joyce et al. 2023b,a;
Patel et al. 2023) — but are beyond the scope of this article. We acknowledge the support from UMBC Cybersecurity
Our method has leveraged a heuristic approach to large- Leadership – Exploratory Grant Program. Any opinions,
scale evaluation of code/malware. While provably correct conclusions, or recommendations expressed in this material
equivalence is preferable, it is not tenable at this scale of are those of the authors and do not necessarily reflect the
work. Prior work has considered the modifying the raw as- views of UMBC or Booz Allen Hamilton.
References Joyce, R. J.; Raff, E.; Nicholas, C.; and Holt, J. 2023b. Mal-
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; DICT: Benchmark Datasets on Malware Behaviors, Plat-
Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; forms, Exploitation, and Packers. Proceedings of the Con-
Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv ference on Applied Machine Learning in Information Secu-
preprint arXiv:2303.08774. rity.
Ahire, P.; and Abraham, J. 2020. Mechanisms for source Junod, P.; Rinaldini, J.; Wehrli, J.; and Michielin, J. 2015.
code obfuscation in C: novel techniques and implementa- Obfuscator-LLVM–software protection for the masses. In
tion. In 2020 International Conference on Emerging Smart 2015 ieee/acm 1st international workshop on software pro-
Computing and Informatics (ESCI), 52–59. IEEE. tection, 3–9. IEEE.
Balakrishnan, A.; and Schulze, C. 2005. Code obfuscation Kang, S.; Lee, S.; Kim, Y.; Mok, S.-K.; and Cho, E.-S. 2021.
literature survey. CS701 Construction of compilers, 19: 31. Obfus: An obfuscation tool for software copyright and vul-
Biggio, B.; Rieck, K.; Ariu, D.; Wressnegger, C.; Corona, nerability protection. In Proceedings of the Eleventh ACM
I.; Giacinto, G.; and Roli, F. 2014. Poisoning behavioral Conference on Data and Application Security and Privacy,
malware clustering. In Proceedings of the 2014 Workshop 309–311.
on Artificial Intelligent and Security Workshop, AISec ’14, Kolosnjaji, B.; Demontis, A.; Biggio, B.; Maiorca, D.; Giac-
27–36. New York, NY, USA: Association for Computing into, G.; Eckert, C.; and Roli, F. 2018. Adversarial Malware
Machinery. ISBN 9781450331531. Binaries: Evading Deep Learning for Malware Detection in
Biggio, B.; and Roli, F. 2018. Wild Patterns: Ten Years Af- Executables. In 2018 26th European Signal Processing Con-
ter the Rise of Adversarial Machine Learning. In Proceed- ference (EUSIPCO), 533–537.
ings of the 2018 ACM SIGSAC Conference on Computer and Lehmann, D.; Kinder, J.; and Pradel, M. 2020. Everything
Communications Security, CCS ’18, 2154–2156. New York, old is new again: Binary security of {WebAssembly}. In
NY, USA: Association for Computing Machinery. ISBN 29th USENIX Security Symposium (USENIX Security 20),
9781450356930. 217–234.
Demetrio, L.; Coull, S. E.; Biggio, B.; Lagorio, G.; Ar- Lim, J. P.; and Nagarakatte, S. 2019. Automatic Equivalence
mando, A.; and Roli, F. 2021. Adversarial EXEmples: A Checking for Assembly Implementations of Cryptography
Survey and Experimental Evaluation of Practical Attacks on Libraries. In 2019 IEEE/ACM International Symposium on
Machine Learning for Windows Malware Detection. ACM Code Generation and Optimization (CGO), 37–49.
Trans. Priv. Secur., 24(4). Linn, C.; and Debray, S. 2003. Obfuscation of executable
Demontis, A.; Melis, M.; Biggio, B.; Maiorca, D.; Arp, D.; code to improve resistance to static disassembly. In Proceed-
Rieck, K.; Corona, I.; Giacinto, G.; and Roli, F. 2019. Yes, ings of the 10th ACM conference on Computer and commu-
Machine Learning Can Be More Secure! A Case Study on nications security, 290–299.
Android Malware Detection. IEEE Transactions on De- Liu, C.; Saul, R.; Sun, Y.; Raff, E.; Fuchs, M.; Pantano, T. S.;
pendable and Secure Computing, 16(4): 711–724. Holt, J.; and Micinski, K. 2024. Assemblage: Automatic
Fleshman, W.; Raff, E.; Zak, R.; McLean, M.; and Nicholas, Binary Dataset Construction for Machine Learning. In The
C. 2018. Static Malware Detection & Subterfuge: Quantify- Thirty-eight Conference on Neural Information Processing
ing the Robustness of Machine Learning and Current Anti- Systems Datasets and Benchmarks Track.
Virus. In 2018 13th International Conference on Malicious MetaAI. 2024. Introducing Llama 3.1: Our most capable
and Unwanted Software (MALWARE), 1–10. IEEE. Best Pa- models to date — ai.meta.com. https://ai.meta.com/blog/
per Award. meta-llama-3-1/. Accessed: 2024-08-16.
Garba, P.; and Favaro, M. 2019. Saturn-software deobfus- MistralAI. 2024. Codestral: Hello, World! — mistral.ai.
cation framework based on llvm. In Proceedings of the 3rd https://mistral.ai/news/codestral/. [Accessed 16-08-2024].
ACM Workshop on Software Protection, 27–38. Montolio, A. G. 2023. A gentle introduction to SMT-based
Gupta, M.; Akiri, C.; Aryal, K.; Parker, E.; and Praharaj, L. program analysis — Fura Labs — furalabs.com. https:
2023. From chatgpt to threatgpt: Impact of generative ai in //furalabs.com/blog/2023/02/12/intro to smt analysis. [Ac-
cybersecurity and privacy. IEEE Access. cessed 20-12-2024].
Hosseinzadeh, S.; Rauti, S.; Laurén, S.; Mäkelä, J.-M.; Muennighoff, N.; Liu, Q.; Zebaze, A.; Zheng, Q.; Hui, B.;
Holvitie, J.; Hyrynsalmi, S.; and Leppänen, V. 2018. Diver- Zhuo, T. Y.; Singh, S.; Tang, X.; Von Werra, L.; and Long-
sification and obfuscation techniques for software security: pre, S. 2023. Octopack: Instruction tuning code large lan-
A systematic literature review. Information and Software guage models. arXiv preprint arXiv:2308.07124.
Technology, 104: 72–93. Na, C.; Choi, Y.; and Lee, J.-H. 2023. DIP: Dead code in-
Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, sertion based black-box attack for programming language
X.; Lo, D.; Grundy, J.; and Wang, H. 2023. Large language model. In Proceedings of the 61st Annual Meeting of the
models for software engineering: A systematic literature re- Association for Computational Linguistics (Volume 1: Long
view. arXiv preprint arXiv:2308.10620. Papers), 7777–7791.
Joyce, R. J.; Amlani, D.; Nicholas, C.; and Raff, E. 2023a. Nagra, J.; and Collberg, C. 2009. Surreptitious software: ob-
MOTIF: A Malware Reference Dataset with Ground Truth fuscation, watermarking, and tamperproofing for software
Family Labels. Computers & Security, 124: 102921. protection. Pearson Education.
OpenAI. 2024. gpt-40. https://platform.openai.com/docs/
models/gpt-4o. [Accessed 16-08-2024].
Patel, T.; Lu, F.; Raff, E.; Nicholas, C.; Matuszek, C.; and
Holt, J. 2023. Small Effect Sizes in Malware Detection?
Make Harder Train/Test Splits! Proceedings of the Confer-
ence on Applied Machine Learning in Information Security.
Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan,
X. E.; Adi, Y.; Liu, J.; Remez, T.; Rapin, J.; et al. 2023. Code
llama: Open foundation models for code. arXiv preprint
arXiv:2308.12950.
Saul, R.; Liu, C.; Fleischmann, N.; Zak, R. J.; Micinski,
K.; Raff, E.; and Holt, J. 2024. Is Function Similarity
Over-Engineered? Building a Benchmark. In The Thirty-
eight Conference on Neural Information Processing Systems
Datasets and Benchmarks Track.
Schloegel, M.; Blazytko, T.; Contag, M.; Aschermann, C.;
Basler, J.; Holz, T.; and Abbasi, A. 2022. Loki: Hardening
code obfuscation against automated attacks. In 31st USENIX
Security Symposium (USENIX Security 22), 3055–3073.
Team, C. 2024. Codegemma: Open code models based on
gemma. arXiv preprint arXiv:2406.11409.
Tristan, J.-B.; Govereau, P.; and Morrisett, G. 2011. Evalu-
ating value-graph translation validation for LLVM. In Pro-
ceedings of the 32nd ACM SIGPLAN Conference on Pro-
gramming Language Design and Implementation, PLDI ’11,
295–305. New York, NY, USA: Association for Computing
Machinery. ISBN 9781450306638.
Wang, Y.; Wang, W.; Joty, S.; and Hoi, S. C. 2021. Codet5:
Identifier-aware unified pre-trained encoder-decoder mod-
els for code understanding and generation. arXiv preprint
arXiv:2109.00859.
Wong, M.; Raff, E.; Holt, J.; and Netravali, R. 2023. Mar-
volo: Programmatic Data Augmentation for Deep Malware
Detection. In Machine Learning and Knowledge Discov-
ery in Databases: Research Track: European Conference,
ECML PKDD 2023, Turin, Italy, September 18–22, 2023,
Proceedings, Part I, 270–285. Berlin, Heidelberg: Springer-
Verlag. ISBN 978-3-031-43411-2.
Xu, H.; Zhou, Y.; Ming, J.; and Lyu, M. 2020. Layered ob-
fuscation: a taxonomy of software obfuscation techniques
for layered security. Cybersecurity, 3: 1–18.
Yang, A. Z.; Kolak, S.; Hellendoorn, V. J.; Martins, R.; and
Goues, C. L. 2024. Revisiting Unnaturalness for Automated
Program Repair in the Era of Large Language Models. arXiv
preprint arXiv:2404.15236.
Zhu, Q.; Guo, D.; Shao, Z.; Yang, D.; Wang, P.; Xu, R.; Wu,
Y.; Li, Y.; Gao, H.; Ma, S.; et al. 2024. DeepSeek-Coder-
V2: Breaking the Barrier of Closed-Source Models in Code
Intelligence. arXiv preprint arXiv:2406.11931.