0% found this document useful (0 votes)
49 views14 pages

Locus

The document presents Locus, a novel framework designed to enhance directed fuzzing by synthesizing semantically meaningful predicates that capture fuzzing progress towards target program states. Locus automates the generation of these predicates using an agentic framework, significantly improving the efficiency of fuzzing processes, achieving an average speedup of 41.6× across various state-of-the-art fuzzers. The framework has successfully identified previously unpatched vulnerabilities, demonstrating its effectiveness in real-world applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views14 pages

Locus

The document presents Locus, a novel framework designed to enhance directed fuzzing by synthesizing semantically meaningful predicates that capture fuzzing progress towards target program states. Locus automates the generation of these predicates using an agentic framework, significantly improving the efficiency of fuzzing processes, achieving an average speedup of 41.6× across various state-of-the-art fuzzers. The framework has successfully identified previously unpatched vulnerabilities, demonstrating its effectiveness in real-world applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Locus: Agentic Predicate Synthesis for Directed Fuzzing

Jie Zhu Chihao Shen Ziyang Li


University of Chicago University of Maryland John Hopkins University
Chicago, USA College Park, USA Baltimore, USA
jiezhu@uchicago.edu stevencs@umd.edu ziyang@jhu.edu

Jiahao Yu Yizheng Chen Kexin Pei


Northwestern University University of Maryland University of Chicago
Evenston, USA College Park, USA Chicago, USA
jiahao.yu@northwestern.edu yzchen@umd.edu kpei@cs.uchicago.edu
arXiv:2508.21302v2 [cs.CR] 3 Sep 2025

Abstract DGF is challenging, as the target program states are often deeply
Directed fuzzing aims to find program inputs that lead to specified nested in the program, while the search space introduced by the
target program states. It has broad applications, such as debugging complexity of real-world software is prohibitively large. To speed
system crashes, confirming reported bugs, and generating exploits up the search and schedule promising inputs, most existing tech-
for potential vulnerabilities. This task is inherently challenging niques leverage the control flow proximity, e.g., the distance to
because target states are often deeply nested in the program, while target location in the control flow graph, or heuristics based on the
the search space manifested by numerous possible program inputs semantics of the branch predicates [12, 47, 74], e.g., the target state
is prohibitively large. Existing approaches rely on branch distances if x==42 has the distance metric of |𝑥 −42|. However, such feedback
or manually-specified constraints to guide the search; however, is sometimes too sparse or indirect to reliably measure the progress,
the branches alone are often insufficient to precisely characterize especially when there is a long chain of implicit preconditions
progress toward reaching the target states, while the manually guarding the target program states [1, 25, 33, 40, 41, 62, 69, 76, 100].
specified constraints are often tailored for specific bug types and For example, triggering CVE-2018-13785 in libpng requires a PNG
thus difficult to generalize to diverse target states and programs. file to satisfy a precise sequence of preconditions, i.e., valid sig-
We present Locus, a novel framework to improve the efficiency nature, correct chunk layout, specific IHDR fields (e.g., bit depth,
of directed fuzzing. Our key insight is to synthesize predicates to color type), and a magic image width (0x55555555), to trigger an
capture fuzzing progress as semantically meaningful intermediate integer overflow [40], while the predicates to explicitly check these
states, serving as milestones towards reaching the target states. conditions are largely absent in the code to provide an incremental
When used to instrument the program under fuzzing, they can re- progress guidance.
ject executions unlikely to reach the target states, while providing To capture the intricate feedback to improve search efficiency,
additional coverage guidance. To automate this task and gener- more advanced approaches identify progress-capturing constraints
alize to diverse programs, Locus features an agentic framework in the program to drive execution towards satisfying specific tem-
with program analysis tools to synthesize and iteratively refine the poral orders and preconditions [1, 3, 26, 33, 38, 43, 47, 49, 61, 65].
candidate predicates, while ensuring the predicates strictly relax However, these constraints are often manually crafted by experts
the target states to prevent false rejections via symbolic execution. and tailored to a few specific target state types, e.g., focusing on
Our evaluation shows that Locus substantially improves the effi- a temporal memory safety bug like a use-after-free by enforcing
ciency of eight state-of-the-art fuzzers in discovering real-world an allocate–free–use sequence. As different programs can have di-
vulnerabilities, achieving an average speedup of 41.6×. So far, Lo- verse target states and disparate functionalities, the feedback metric
cus has found eight previously unpatched bugs, with one already effective in one case may not generalize to another [14].
acknowledged with a draft patch. As Machine Learning (ML) and Large Language Models (LLMs)
have demonstrated surprising code reasoning capabilities, there
has been a growing interest in extending such capabilities to help
guide fuzzing [66]. A common approach is to employ LLMs to
1 Introduction directly generate inputs or grammar-aware input generators (i.e.,
Directed Greybox Fuzzing (DGF) aims to search for program inputs fuzz driver or harness) [10, 37, 52, 53, 58, 77, 96, 101], where the
leading its execution to reach specific target program states, e.g., an preconditions useful to reach target states are expressed as part
index variable used to access an array is out of the array bounds, of the input grammar constraints. However, not all conditions to
thereby uncovering potential bugs or vulnerabilities. It is widely reach target states can be easily represented as input grammars. For
used in software engineering and security applications, including example, Figure 1 shows that an intermediate state (!found_plte)
debugging system crashes [81, 85], testing patches [4, 60], verifying necessary to trigger the buffer overflow in libpng only emerges
bug reports from static analysis [15, 63], and generating Proof-of- in the middle of the execution and cannot be easily checked at the
Concept (PoC) exploits of vulnerabilities [5, 52]. While the original input level. More importantly, such a task formulation is particularly
purpose of directed fuzzing is largely not for discovering new bugs, challenging for LLMs, as reasoning from the target program states all
i.e., it needs a specified target, it has found numerous impactful the way to the input often requires an extremely long context and
zero-day vulnerabilities [8, 49, 64, 74, 86]. convoluted reasoning chain [21, 39, 44, 45, 48, 70, 71, 71], a common
1
pitfall for LLM hallucination [72, 98], let alone the challenge to bugs. We have responsibly reported all the bugs, and one of the
verify the correctness of the LLM generated inputs and harnesses, maintainers has already acknowledged it with a drafted patch.
which can significantly impede the fuzzing progress if the generated
constraints are incorrect [38, 65]. 2 Overview
Our approach We present Locus, a new framework that integrates In this section, we first describe the background of directed fuzzing.
LLMs’ code reasoning capabilities for directed fuzzing by synthesiz- We then contrast our idea with the existing approaches to demon-
ing semantically meaningful and verifiable predicates to guide the strate how Locus complements the existing design (Figure 1).
search. Unlike existing LLM-based approaches that focus on con-
straining the search space at the input (harness) level, which poses 2.1 Directed Fuzzing
a high reasoning burden on LLMs and can be hard to verify, Locus
Canary Directed fuzzing aims to generate inputs that drive the
generalizes the constraint generation to be at arbitrary program
program execution to reach predefined program states. In this pa-
points. Specifically, given a target state to reach, Locus automates
per, we rely on the concept canaries [32] to explicitly represent the
the reasoning about the intermediate program states that verifiably
target states, which are considered reached when the correspond-
relax the target states and synthesizes predicates as a curriculum
ing canary condition is satisfied. Early works like AFLGo [6] and
to capture the gradual progress towards reaching them. Such predi-
Beacon [35] treat canaries primarily as the reachability to particu-
cates serve as the preconditions dominating all executions to reach
lar program points, i.e., specific line numbers in a code file. Recent
the target states, providing fine-grained progress feedback to guide
works [32, 88] adopt assertion-based canaries, where predicates are
the fuzzers for input scheduling and early termination [35, 56, 78].
explicitly inserted to check vulnerability-triggering conditions at
As these predicates are implemented as source-level instrumen-
runtime. An example of a canary is highlighted in the bottom-left
tation, they are agnostic to any fuzzer implementations and can
code snippet of Figure 1.
thus be integrated without any customization. Moreover, as the
Canaries for directed fuzzing originate from various sources,
instrumentation is a one-time offline process, the cost of running
including static analysis alerts, manually identified vulnerability
Locus can be amortized in all the succeeding fuzzing procedures.
sites, or runtime sanitizers. For example, address sanitizers [73] can
Agentic design Automating the generation of effective progress- also be approximately viewed as canaries checking memory safety
capturing predicates for diverse target states involves nontrivial violations, e.g., inserting canary(index > maxbound) to detect
reasoning across the entire codebase, spanning multiple functions out-of-bound accesses.
and their associated data and control flows. Simple prompting tech-
Common strategies to reach canaries Previous research on
niques, such as in-context learning, chain-of-thought (CoT) prompt-
directed fuzzing can be broadly grouped into four main strategies:
ing, retrieval-augmented generation, etc., can hardly support so-
phisticated reasoning at the repository level. To this end, Locus • Distance-guided scheduling. This approach approximates the
features an agentic synthesizer-validator workflow, equipped with distance from the currently covered regions of the CFG to the
diverse program analysis tools to support the traversal of the control canary, and then prioritizes seeds that are more likely to drive
flow graph, tracking data dependencies, retrieving function calls, execution along paths that reduce this distance [6, 11, 24]. For
and symbolic execution. They serve as the agent’s action space, paths that are unlikely to reach the canary, the execution can
allowing the LLM to reason via CoT and act by calling them [97] to be early terminated [35, 56, 78].
iteratively propose, refine, and validate candidate predicates. • Specialized progress-capturing state representations. Some
Importantly, by constraining the output space of the agent at the approaches introduce domain-specific abstractions tailored for
predicate-level, Locus ensures any (inevitable) LLM errors are cor- certain classes of vulnerabilities [43, 61]. For example, tracking
rected before deployment for fuzzing. Specifically, Locus’s output dataflow from memory allocation to deallocation events for
is validated by both the compiler and symbolic execution [7]. The the use-after-free vulnerability.
former checks the syntactic correctness, while the latter ensures the • LLM-assisted harness generation. A more recent direction
generated predicates strictly relax the target states, i.e., by check- leverages LLMs to synthesize input generators to manipulate
ing whether there exists a path that violates the predicates while raw program inputs [54, 90, 96, 102]. These methods attempt
satisfying the target states. Such a strict relaxation ensures that the to constrain the valid execution at the input level by leveraging
fuzzing execution can be safely early-terminated if it violates the the learned input grammar knowledge in the LLMs.
predicate. Figure 2 shows the Locus’s workflow.
2.2 Motivating Example
Results We evaluate Locus on the Magma benchmark [32] with
We use a real-world vulnerability CVE-2013-6954 in libpng, a
eight widely used libraries and 10 vulnerability types across eight
widely used C library for parsing PNG files, to demonstrate how
state-of-the-art fuzzers, covering both directed and undirected ones.
Locus complements existing approaches.
Locus achieves a significant 70.3× speedup on average for directed
A simplified CFG of this vulnerability is shown in Figure 1a,
fuzzers, with up to 214.2× speedup when integrated to accelerate
where all the nodes highlighted in gray are equidistant to the ca-
SelectFuzz [56], one of the state-of-the-art directed fuzzers. For
nary node (highlighted in yellow), but only one node PLTE is in
coverage-guided fuzzers, Locus accelerates them by 13× on aver-
the path that can lead to the vulnerability (annotated as green ar-
age, including 15.3× speedup for the extensively optimized fuzzer
rows). Based on this, fuzzers can only perform conservative pruning
like AFL++. So far, Locus has found eight previously unpatched
(annotated as dashed arrows) while omitting a large part of the
2
Input Generator LLVMTestOneInput png_read_info
png_read_info(png_ptr, ...);
int found_plte = 0;

...
Generic Input Grammar ... // processing
for (;;) {

PLTE
write_signature(png_ptr, 13);
png_write_png(png_ptr, ...) ...

png_sig_cmp(v.data(), ...); int chunk_name = pngptr->chunk_name;

color Reachability Conflicts png_write_png if (chunk_name == png_IDAT) {

if (!found_plte) EXIT();

png_write_frame_head(p, ...)
if (png->color != PALETTE) EXIT();
...

fig_decode(file, ...) if (color_type & MASK_PALETTE)


} else if (chunk_name == png_PLTE) {

num > PNG_MAX_ ...


Overhead by Repeated Logics png_set_PLTE(png_ptr, ...); found_plte = 1;

...

control flow edges


uint32_t c = crc;
png_set_PLTE }

edges pruned by existing DGF if (!crc_table_computed)


}

path to trigger the canary make_crc_table();


CANARY(num > PNG_MAX_PALETTE_LENGTH); if (!found_plte) EXIT();
for (int n = 0; n < len; n++)

nodes equidistant to canary


c = crc_table[(c ^ buf[n])] Buffer Overflow Refine
canary node

(a) A simplified CFG (b) LLM-generated code (c) Locus synthesizes predicates in relevant functions.

Figure 1: A motivating example (CVE-2013-6954) showing how Locus complements existing works. (a) Traditional approaches
based on distance to targets in CFG lack fine-grained guidance to distinguish nodes when they have the same distance. (b)
LLM-based harness generation is limited to help reach the target. (c) Predicates (as if statements) synthesized by Locus provide
extra semantic guidance for DGFs, while relaxing the constraint generation from input-level to arbitrary program points.

paths that are irrelevant to the vulnerability, e.g., all solid black ar- chunk, which serves as the necessary state before the target state
rows. Capturing the progress towards reaching this canary requires can be reached, i.e., the palette size exceeds the expected bounds.
specialized knowledge about libpng parsing logic. However, de- Existing fuzzers struggle with such a vulnerability due to the
veloping such specialized progress-capturing state representations complex parsing logic. Since PNG chunks may appear in arbitrary
requires manual efforts from experts and cannot easily generalize order and many are optional, the primary parsing routine imple-
to different vulnerabilities and programs. ments a loop in png_read_info to iteratively process each chunk.
Figure 1b shows an example of the input generator synthesized Within this loop, multiple branches exist, and each is responsi-
by LLM-based approaches that enforced some specific input con- ble for handling a specific chunk type, e.g., IHDR, IDAT, PLTE. This
straints. However, this approach is inherently limited as program loop makes it possible to identify whether a given PNG file con-
inputs can have sophisticated structures that cannot be easily en- tains a PLTE chunk. However, since these branches are syntactically
forced at the input level, e.g., the PNG file contains a compressed parallel and executed without a fixed order, they all appear equidis-
data chunk IDAT. Therefore, the constraints posed on the input tant from the vulnerability site (png_set_PLTE) in the control flow
generated by the LLM often reduce to generic input grammars, e.g., graph. As a result, the naive path distances cannot distinguish be-
png_sig_cmp only trivially ensures the input is a valid PNG file. We tween them to prioritize one path over the other. This explains why,
also note that the synthesized generators cannot be easily checked in Table 2 and Table 3, all existing fuzzers take a substantial amount
to determine whether they can effectively help reach the canary. of time to reach this target state in libpng.
For example, png_write_frame_head constrains the generated in- Locus starts with generating a predicate to check whether the
put to a special PNG type APNG that is essentially impossible to input PNG file contains a PLTE chunk at the caller of the canary
reach the canary, but automatically checking this fact is challeng- function ( 1 ). This implies that if the PNG file does not contain
ing. Moreover, to constrain the input to satisfy certain properties a PLTE chunk, the execution can terminate, as it is impossible to
necessary to reach the target, the generator sometimes needs to reach the target state. While this predicate is semantically correct,
repeat the input processing logic presented in the execution path. i.e., the symbolic execution confirms there is no feasible path to
Such repeated execution introduces additional overhead. satisfy the target state while violate the generated predicate (Sec-
Locus synthesizes progress-capturing predicates at arbitrary pro- tion 3.4), and can help filter out non-palette-based PNG files, it is
gram points to provide more fine-grained guidance and complement almost redundant as there is an existing one immediately after the
the above approaches. Specifically, Locus generates a predicate in generated check (if (color_type & MASK_PALETTE)), and thus
png_read_info (shown in Figure 1c), representing the precondition the generated predicate could barely help the fuzzer.
to trigger a real-world buffer overflow vulnerability and terminate To address this issue, Locus includes another refinement itera-
the execution if it is not satisfied (if(!found_plte)). The target tion (Algorithm 1) to propagate this predicate to a location closer
state is highlighted in a canary statement. Locus’s trajectory re- to the program entry such that the infeasible execution can be
veals that it relies on the semantic reasoning of libpng’s parsing terminated earlier. By traversing the call graph and retrieving the
behaviors for generating this semantically meaningful predicate. functions along the call chain, the refinement proposes a new loca-
Specifically, PNG files consist of a series of structured chunks, tion in function png_read_info right after the parsing loop ( 2 ).
many of which are optional and can appear in varying orders. One Such a propagated predicates will also be validated again to ensure
such optional chunk, PLTE, stores the palette data used for indexed- it remains a strict precondition to reach the target states. At the next
color images and must adhere to strict size constraints. In this iteration of refinement, Locus proposes to propagate this predicate
example, an input PNG file can trigger a buffer overflow in the further inside the branch that parses the IDAT chunk ( 3 ). That is
png_set_PLTE function only when the PNG file contains a PLTE because the specification of the PNG file requires that the optional

3
Locus ( Synthesis Validation )
Program Localization Generation
├─ README.md
refined
├─ bar.c
The program input should tools/...
diff a/bar.c b/bar.c

├─ utils
look like a special ... bar.c:
index 2e34.. 1044

│ ├─ foo.c
Localize read_file
initial @@-6 +8@@ get(ptr *p,

└─ ... ... features a error get


uint8_t idx = 0;

and can be fixed... ...


Generate int * dst = malloc

utils/foo.c:
+ if (a > b)

Canary This bug is caused


by the ...
Refine edit_sig
+ EXIT 0;

iter_by dst[idx] = *p + 1;

if(a > b + abs(c)); a > b + abs(c) p += idx;


Canary Reasoning Localized Functions Localized Program Point Generated Predicate

Fuzzer Semantic Validation Syntax Validation

Path Constraints [100%] Built


Program target
a > b a > b make[2]:
├─ README.md
[png.dir/
├─ bar.c M
build.make:122
├─ utils
Check Satisfiability Symbolic
Reachability
Compile
│ ├─ foo.c
clang: error:

Execution Pruning linker command


└─ ...
failed with
Cannot find a path a > b + abs(c) a > b + abs(c) exit code 1

Crashing Input Predicates Validation Pruned Control Flow Graph Control Flow Graph Build Report

Figure 2: Overview of Locus workflow. Locus takes as inputs the program codebase 𝑃 and the canary 𝜓 , and produces a program
𝑃 ′ instrumented with the progress-capturing predicates. The predicate branches provide extra coverage feedback and guards
(via early termination) to guide the fuzzer toward reaching the target state, i.e., canary 𝜓 , more efficiently.

PLTE chunk must appear before the IDAT chunk. Therefore, by the Definition 1. For a vulnerability 𝑣, the program 𝑃 ′ is fuzzing
time the parsing procedure reaches the IDAT chunk, we can already admissible to 𝑃, iff ∀𝑥 ∈ 𝑋, 𝑃 ′ ⇓𝑥 𝑆 𝑣 =⇒ 𝑃 ⇓𝑥 𝑆 𝑣 .
determine whether the input PNG file contains a PLTE chunk. While it is unlikely to ensure 𝑃 ′ is fuzzing admissible to 𝑃 in
Such a finalized predicate benefits all fuzzers, as it can prioritize general without a pre-defined target vulnerability 𝑣, we show such a
inputs with PLTE properties early in the execution and save the property is well-defined when 𝑣 is given and explicitly represented
fuzzers from considering PLTE-irrelevant PNG images. With such a by the canary 𝜓 .
predicate, AFLGo [6] gained an impressive 8× speedup in triggering A program predicate 𝜙 : 𝑠 → {True, False} is a boolean mapping
this vulnerability with only a three-line change in png_read_info. over the program state space. Concretely, any conditions inside the
branch statements, e.g., if, while, or assert, can be regarded as
3 Methodology predicates, as they evaluate a Boolean expression over the program
Figure 2 illustrates the high-level workflow of Locus. Given a state. We define a special class of predicates, namely canaries, to
specified canary 𝜓 and the program codebase 𝑃, Locus outputs a characterize vulnerable states:
new codebase 𝑃 ′ instrumented by a set of predicates Φ, where each
Definition 2. A vulnerability canary 𝜓 is a predicate s.t. ∀𝑠 ∈
𝜙 ∈ Φ is represented by a branch condition with early exit if the
𝑆 𝑣 ⇔ 𝜓 (𝑠) = True.
predicate condition is not satisfied. The synthesizer is responsible
for generating candidate predicates Φ, while the validator ensures The goal of directed fuzzing towards 𝑆 𝑣 is equivalent to finding
that Φ are both syntactically valid and semantically consistent with inputs that satisfy the canary𝜓 . To provide semantically meaningful
the 𝜓 , i.e., relaxing the canary conditions. The fuzzer will run on guides to directed fuzzing, we may instrument the program 𝑃 with
𝑃 ′ to receive more progress feedback to the canary while enjoying an additional predicate 𝜙. Such instrumentation is admissible if and
the early termination. In the following, we formalize the task and only if 𝜙 is a relaxation of the true vulnerability canary 𝜓 :
then elaborate on each design component in Locus. Definition 3. A predicate 𝜙 is the relaxation of canary 𝜓 , if
∀𝑠 ∈ 𝑆 𝑣 ,𝜓 (𝑠) = True =⇒ 𝜙 (𝑠) = True.
3.1 Formalization By definition, 𝑃 ′ instrumented by Φ is fuzzing admissible if every
We use the notation 𝑃 ⇓𝑥 𝑆 to indicate that the program 𝑃, when predicate 𝜙 is a relaxation of 𝜓 . This suggests that fuzzing 𝑃 ′ is
executed with input 𝑥 ∈ 𝑋 sampled from 𝑃’s input space 𝑋 , can equivalent to fuzzing 𝑃 while enjoying the additional guidance and
reach a set of program states 𝑆. Assume triggering a vulnerability 𝑣 early termination introduced by the instrumented predicates.
can be characterized as reaching a set of program states 𝑆 𝑣 . Given
the program under fuzz (𝑃 ′ ) instrumented by Φ, we need to make Theorem 1. The instrumented program 𝑃 ′ is fuzzing admissible
sure that the early termination introduced in Φ preserves the same to 𝑃, if 𝑃 ′ is instrumented with Φ, where every 𝜙 ∈ Φ is the relaxation
fuzzing behavior on 𝑃 ′ as that of 𝑃, i.e., Φ do not reject any 𝑥 ∈ 𝑋 of 𝜓 .
that would have reached 𝑆 𝑣 in 𝑃. To this end, we formally define We next illustrate how our agentic synthesizer-validator work-
fuzzing admissibility. flow can produce an admissible instrumented program 𝑃 ′ which
4
Table 1: Toolset for synthesizer agent

API Description Output Example Stats


Search
class(cls, [file]) Search for class or structure cls in the codebase or file typedef struct png_XYZ { ... } 13%
method(m, [file]) Search for method or function m in the codebase or file void png_read(png_ptr, ...) {...} 17%
symbol(s, [file]) Search for symbol in the codebase or file TIFFDataType dtype = TIFF_BYTE; 10%
code(c, [file]) Search for code snippet c in the codebase or file raw2tiff.c:302:memcpy(buf, ...) 21%
Graph
callers(f) Get the caller names of function f in the codebase [exif_iif_add_tag, exif_iif_add_fmt, ...] 13%
callees(f) Get the callee names of function f in the codebase [php_strnlen, estrndup, php_ifd_double,...] 6%
references(s) Get all references of the symbol s in the codebase [misc.c:2670:if(sig_num!=SIGALRM){, ...] 8%
Listing
files(dir) List all file names under the given path dir [caf.c, chunk.c, common.h, ...] 11%
classes(file) list all classes and structures defined in file [SF_INFO, sf_private_tag, SF_VIRTUAL_IO, ...] 5%
methods(file) List all methods and functions defined in file [exif_get_tag_ht, ifd_get32s, ptr_offset, ...] 6%

helps provide rich semantic feedback for the directed fuzzer to reach Algorithm 1 Scaffold of Locus’s agentic workflow
vulnerable states 𝑆 𝑣 . Require: original program 𝑃, vulnerability canary 𝜓
Ensure: an target-conditional equivalent program 𝑃 ′
3.2 Agent Toolset 1: C ← CanaryReasoning(𝑃,𝜓 ) ⊲ list of reasoning tokens
Existing agentic practices for software analyses, such as bug fixing 2: Φ ← ∅ d

or fault localization, are often equipped with lightweight command- 3: for all 𝑐 ∈ C do

line tools to perform local reasoning [67, 95, 105]. This is effective 4: 𝑙 ← Localize(𝑐, 𝑃) ⊲ find the initial program point
because the root cause of a bug is often spatially close to the observ- 5: 𝑛←0
able failure, allowing the agent to retrieve relevant context from 6: repeat
a narrow portion of the entire codebase. However, in our task for- 7: 𝜙𝑙 ← Generate(𝑙, 𝑐, 𝑃)
mulation, predicates can be synthesized at any program point. As a 8: while ¬validate(𝜙𝑙 ,𝜓, 𝑃) do ⊲ syntax and semantic
result, local reasoning is insufficient, and the agent needs additional 9: 𝜙𝑙 ← Generate(𝑙, 𝜙, 𝑃)
tools to support analysis of a broader code context. 10: end while
To equip the agent with long-context code reasoning at arbitrary 11: 𝑙 ← Localize(𝜙, 𝑙, 𝑐, 𝑃) ⊲ refine, find a better location
program points, we provide a suite of tools that Locus can invoke 12: 𝑛 ←𝑛+1
to navigate the entire codebase and retrieve relevant code snippets. 13: until 𝑙 = None ∨ 𝑛 > MaxIterations
Particularly, in addition to common tools like code search and file 14: Φ ← Φ ∪ {𝜙𝑙 }
listing, Locus integrates specialized tools for traversing program 15: end for

graphs, including call graphs and reference graphs. Through the 16: 𝑃 ′ ← Instrument(𝑃, Φ) ⊲ fuzzing admissible program
call graph APIs, the synthesizer agent can retrieve function call
relationships and reason the interprocedural control flow, while the
reference graph API allows it to identify variable usages, pointer
(lines 4–13), which are then refined and validated for both syntax
dereferences, and data access patterns across the codebase. Table 1
and semantics (line 8). Once validated, the predicates will instru-
shows the complete toolset and the ratio of their actual usage in
ment the original program to produce a fuzzing-admissible program
our experiments. Among them, graph traversal accounts for over a
(line 16).
quarter of all API invocations.
As we demonstrate in Theorem 1, a predicate 𝜙 should be in-
It is important to note that these graphs are only provided as sup-
strumented at the execution path that can reach the canary 𝜓 . To
plementary references to Locus, as they are all derived statically and
synthesize 𝜙, LLMs must reason about the root cause of the vulner-
may miss program behaviors such as dynamic dispatches and indi-
ability and approximate potential execution traces of the program
rect calls. We observe that the LLM used in Locus is capable of lever-
that can trigger it. The execution trace of a program is often long
aging its learned knowledge to bridge this gap. For example, in the
and complex, involving multiple functions and files. Therefore, a
case of vulnerability TIF002 (see Table 2 and Table 3), Locus success-
naive one-shot synthesis approach may not be sufficient, as the
fully resolved the indirect function pointer tif->tif_decoderow
synthesizer only retrieves limited context and proposes primitive
to its concrete implementation PixarLogDecode.
predicates, e.g., an initial predicate generated by Locus ( 2 in Fig-
ure 1c). To address this, Locus employs an iterative localization-
3.3 Synthesis generation refinement workflow. In each iteration, the synthesizer
Algorithm 1 elaborates on the workflow of Locus. For each candi- generates a predicate that preserves the same semantic meaning
date predicate, it iteratively localizes and generates the predicates while moving its placement closer to the program entry.
5
Locus first synthesizes an initial set of predicates by analyzing Semantic validation The second component is semantic vali-
the canary and approximates a list of semantic characteristics of dation, which confirms that the predicate 𝜙 strictly relaxes 𝜓 . It
the inputs that are likely related to this canary, e.g., data struc- ensures that the predicate does not reject any execution paths that
tures, types, and properties. The synthesizer is then prompted to could reach the vulnerability (Theorem 1). As we demonstrated
consider these constraints from multiple dimensions. For example, in Definition 3, such validation requires enumerating all possible
in Figure 1, besides the predicate, the synthesizer generates mul- program inputs, which cannot be done within an acceptable time
tiple constraints, such as requiring that the input PNG file must budget. Therefore, we utilize symbolic execution to find counterex-
contain a valid signature. These approximations are progressively amples of the relaxation.
concretized and refined as the synthesizer retrieves more relevant
code and reasons about the program execution. Theorem 2. A predicate 𝜙 is not a relaxation of 𝜙 ′ , if there exists
For each approximated characteristic, the synthesizer needs to 𝑃 ⇓𝑥 𝑠, s.t. 𝜙 (𝑠) = False ∧ 𝜙 ′ (𝑠) = True
identify an appropriate program point 𝑙 where the predicate can
be expressed in terms of the variables and expressions in scope. Specifically, the predicate is not a strict relaxation of the canary
Constraining LLM to directly identify the exact program point is if the symbolic execution can find a path that satisfies the negated
often too challenging and unreliable, so the initial stage only asks predicate ¬𝜙 while the canary 𝜓 still evaluates to true.
the synthesizer to select a candidate function rather than a precise It is natural to employ symbolic execution as a formal checker for
program point. Given the reasoning context generated by the LLM Theorem 2 to ensure fuzzing admissibility. However, it often incurs
so far, the synthesizer attempts to generate an initial predicate substantial overhead by exploring irrelevant execution paths [7],
𝜙𝑙 in the candidate function. Once 𝜙𝑙 passes the validation stage e.g., branches that are unrelated to either the synthesized predicate
(Section 3.4), we refine it in the following iterations. The goal of or the target canary. This excessive path exploration can lead to
such a refinement is to identify an earlier program point where the prohibitively long validation time, incurring additional overhead
same semantic check can be performed. This allows the program of deploying Locus. To mitigate this inefficiency, we adopt a strat-
to reject invalid inputs sooner in the execution, enabling the fuzzer egy inspired by Chopper [82] that skips specified parts of the code
to explore more valid mutations within the same time budget. and targets only the exploration of paths according to our selec-
It is worth noting that with the refinement iteration, a predicate tion. Specifically, we perform a lightweight reachability analysis
can be all the way refined toward the fuzzing harness. In some cases, on the CFG and prune nodes that are not reachable from either the
Locus can indeed effectively synthesize predicates at the program predicate or the canary. Locus then initiates symbolic execution to
entry, making it similar in spirit to automated harness generation explore the path between the ¬𝜙 and 𝜓 .
works such as HGFuzzer [90] and InputBlaster [54]. However, in
most cases, the input constraints cannot be directly accessed at the
4 Evaluation
harness level. For example, in Section 2.2, verifying the presence
and content of a PLTE chunk requires parsing internal structures We aim to answer the following research questions:
of the input that are only accessible deeper in the program. This RQ1 Effectiveness: How effective is Locus in accelerating the
necessitates placing predicates at intermediate program points. generation of PoC inputs for given target vulnerabilities?
RQ2 Cost and performance: What is the time cost and token
3.4 Validation cost for deploying Locus?
The validation step is critical to preserving the correctness of the RQ3 Ablations: How do the individual components of Locus
instrumented program and ensuring that the inserted predicate 𝜙𝑙 contribute to its overall performance?
maintains the fuzzing admissibility. As shown in Algorithm 1 lines RQ4 Security impact: How can Locus assist in real-world vul-
7 to 9, the validator takes as input the candidate predicate 𝜙, the nerability detection scenarios?
vulnerability canary 𝜓 , and the program 𝑃, and validates whether
the predicate is both syntactically and semantically correct. If the 4.1 Setup
predicate passes both checks, Locus will refine the predicate by
exploring potentially better program points closer to the program Dataset We evaluate Locus on the Magma fuzzing benchmark [32],
entries (Section 3.3). If it fails, Locus will self-reflect and attempt to which includes a diverse set of real-world vulnerabilities selected
regenerate the predicate, using diagnostic feedback collected from from nine popular open-source software projects. We also consider
the validator. popular libraries in the wild, e.g., libarchive, to evaluate the capabil-
ities of Locus in finding real-world vulnerabilities (see Section 4.6).
Syntax validation The first component of validation is syntactic
checking. A predicate that fails to compile cannot be used in a Baselines We select eight fuzzers (see Table 4) as our baseline,
fuzzing campaign, regardless of its intended semantics. To verify covering both the state-of-the-art directed and coverage-guided
this, the program is instrumented by the predicate 𝜙 at the desig- ones. For each fuzzer, we use the latest stable version available at
nated program point 𝑙 and invokes the project’s build system using the time of writing.
a predefined command. If the build fails, the associated compiler For directed fuzzers, we consider AFLGo [6], SelectFuzz [56],
error messages will be sent to the synthesizer to repair. This di- Beacon [35], and Titan [36]. We do not include a comprehensive
agnostic information typically involves undeclared symbols, type set of directed fuzzers due to the lack of publicly available imple-
mismatches, or malformed expressions. mentations and the challenges of integrating them into Magma.
6
Table 2: TTE for each vulnerability in the Magma benchmark across different directed fuzzers. T.O. indicates that the fuzzer
cannot find the vulnerability within 24 hours. ∅ indicates that the fuzzer either could not compile the target program or the
preprocessing step exceeded 24 hours.

AFLGo SelectFuzz Beacon Titan


Vul ID
origin Locus ratio 𝑝 origin Locus ratio 𝑝 origin Locus ratio 𝑝 origin Locus ratio 𝑝
PNG003 4 4 1.0 1 3 3 1.0 1 5 5 1.0 1 8 8 1.0 1
PNG006 T.O. T.O. - - T.O. T.O. - - T.O. T.O. - - T.O. T.O. - -
PNG007 72536 9052 8.0 <0.01 58770 8537 6.9 0.04 7764 1659 4.7 <0.01 17561 11954 1.5 0.04
SND001 T.O. 419 206.5 <0.01 7764 5 1725.4 <0.01 432 13 33.2 <0.01 115 8 14.4 <0.01
SND005 1015 205 5.0 <0.01 102 4 23.2 <0.01 95 16 5.9 <0.01 30412 1046 29.1 <0.01
SND006 T.O. 3899 22.2 <0.01 8519 12 709.9 <0.01 T.O. T.O. - - 26171 905 28.9 <0.01
SND007 T.O. 3468 24.9 <0.01 7785 62 126.4 <0.01 T.O. T.O. - - 310 28 11.1 <0.01
SND017 4453 1021 4.4 <0.01 3235 1518 2.1 <0.01 4861 58 83.8 <0.01 40955 256 160.0 <0.01
SND020 3805 1291 2.9 <0.01 694 542 1.3 <0.01 5683 41 138.6 <0.01 70432 604 116.7 <0.01
SND024 T.O. 2577 33.5 <0.01 5957 18 332.8 0.05 T.O. T.O. - - 1074 8 127.8 <0.01
TIF002 T.O. T.O. - - T.O. T.O. - - T.O. T.O. - - T.O. T.O. - -
TIF005 T.O. T.O. - - T.O. T.O. - - T.O. T.O. - - T.O. T.O. - -
TIF006 80998 17774 4.6 <0.01 T.O. T.O. - - 35493 6715 5.3 0.02 66692 65412 1.0 0.04
TIF007 18513 2107 8.8 <0.01 1410 413 3.4 <0.01 220 76 2.9 <0.01 146 109 1.3 <0.01
TIF009 32556 16767 1.9 0.04 12067 4144 2.9 <0.01 T.O. T.O. - - T.O. T.O. - -
TIF012 47424 18689 2.5 0.02 9118 4787 1.9 <0.01 18782 12274 1.5 0.04 2922 2620 1.1 <0.01
TIF014 66000 8134 8.1 <0.01 58851 7833 7.5 <0.01 19035 9672 2.0 0.02 44384 14360 3.1 <0.01
LUA004 T.O. T.O. - - T.O. T.O. - - T.O. T.O. - - 69594 67442 1.0 <0.01
XML001 T.O. T.O. - - T.O. T.O. - - T.O. T.O. - - T.O. T.O. - -
XML003 T.O. T.O. - - T.O. T.O. - - 44056 13295 3.3 <0.01 53689 36365 1.5 0.32
XML009 T.O. 77513 1.1 0.6 T.O. 56 1540.1 <0.01 32585 6748 4.8 <0.01 17942 13879 1.3 0.03
XML017 17 8 2.1 <0.01 7 9 0.7 <0.01 10 15 0.7 0 23 13 1.8 <0.01
SSL001 T.O. T.O. - - 72078 62637 1.2 <0.01 ∅ ∅ - - ∅ ∅ - -
SSL003 222 231 1.0 0.04 73 69 1.1 0.02 ∅ ∅ - - ∅ ∅ - -
PHP004 165 269 0.6 <0.01 11 11 1.0 1 ∅ ∅ - - ∅ ∅ - -
PHP009 99 203 0.5 0.04 185 153 1.2 <0.01 ∅ ∅ - - ∅ ∅ - -
PHP011 ∅ ∅ - - 26 23 1.1 <0.01 ∅ ∅ - - ∅ ∅ - -
SQL018 49520 30855 1.6 0.03 53397 6748 7.9 <0.01 T.O. T.O. - - 78238 42275 1.9 0.04
Speedup 17.0× 214.2× 22.1× 28.0×

We also consider coverage-guided fuzzers, including the widely fuzzing trial is capped at 24 hours, so those that fail to find the
used AFL [99] and AFL++ [27]. To make the evaluation more com- triggering input are recorded at this maximum duration. This po-
prehensive, we include two additional fuzzers. MOPT [57] enhances tentially underestimates the actual TTE for baseline fuzzers, but it
fuzzing with optimized mutation strategies, and Fox [75] optimizes offers a lower-bound estimation. Therefore, the actual improvement
the seed scheduling as online stochastic control. brought by Locus can be even larger. To eliminate hardware and
Note that Locus alone is not a standalone fuzzer, but focuses ex- system-related discrepancies, all experiments are conducted on a
clusively on code transformations for the target software. Therefore, dedicated cluster, where each server comes with an Intel Xeon Gold
it is agnostic to the fuzzer implementations and can complement 6126 CPU and 128GB of RAM running Ubuntu 20.04.
any fuzzers (Figure 2). We implement Locus using the PydanticAI framework [16]. The
synthesizer’s tooling is primarily built on top of Multiplier [29].
Metrics We follow the existing directed fuzzing approaches by
We leverage the SVF static analysis framework [80] to perform a
adopting the Time To Exposure (TTE) to measure the performance
lightweight reachability analysis and use KLEE [7] as the symbolic
of the baseline fuzzers and the improvement introduced by inte-
execution engine to perform semantic validation. We set the LLM
grating Locus. TTE measures the time taken by a fuzzer to find the
generation temperature to zero to avoid non-determinism. We use
input that satisfies the canary condition. To mitigate the inherent
o3-mini-2025-01-31 as the default model with reasoning level
randomness introduced by fuzzing, we follow [32] by executing 10
to medium, but we also evaluate other LLMs in Section 4.4.
independent fuzzing trials per vulnerability sample and report the
average TTE. We also employ the Mann-Whitney U Test [59] to
demonstrate the statistical significance (𝑝-value) of the results. 4.2 RQ1: Vulnerabilities Reproduction
We apply Locus on both directed fuzzers and coverage-guided
Implementations We run all the fuzzers with the same initial
fuzzers, and compare the performance w/ and w/o Locus. The
seed inputs provided in Magma to ensure a fair comparison. Each
results are shown in Table 2 and Table 3.
7
Table 3: TTE for each vulnerability in the Magma benchmark across different coverage-guided fuzzers. T.O. indicates that the
fuzzer cannot find the vulnerability within 24 hours. ∅ indicates that the fuzzer either could not compile the target program or
the preprocessing step exceeded 24 hours.

AFL++ AFL MOPT Fox


Vul ID
origin Locus ratio 𝑝 origin Locus ratio 𝑝 origin Locus ratio 𝑝 origin Locus ratio 𝑝
PNG003 12 8 1.5 0.04 3 3 1.0 1.00 4 4 1.0 1.00 5 5 1.0 1.00
PNG006 106 63 1.7 0.16 T.O. T.O. - - T.O. T.O. - - 6978 1582 4.4 <0.01
PNG007 53104 41101 1.3 0.05 55351 18358 3.0 <0.01 42811 37522 1.1 0.07 4651 4009 1.2 0.05
SND001 451 19 24.4 <0.01 75940 338 224.4 <0.01 285 12 23.6 <0.01 447 27 16.6 <0.01
SND005 1233 193 6.4 <0.01 796 9 88.4 <0.01 53 9 6.3 <0.01 82 15 5.5 <0.01
SND006 7026 24 290.3 <0.01 T.O. 2251 38.4 <0.01 328 17 19.6 <0.01 15422 717 21.5 <0.01
SND007 810 34 23.6 <0.01 T.O. 3163 27.3 <0.01 360 16 22.4 <0.01 3226 129 25.0 <0.01
SND017 841 196 4.3 <0.01 3162 833 3.8 <0.01 33 20 1.7 <0.01 2117 178 11.9 <0.01
SND020 640 204 3.1 0.08 2916 1266 2.3 <0.01 326 103 3.2 <0.01 2068 176 11.8 <0.01
SND024 361 18 19.6 <0.01 T.O. 2391 36.1 <0.01 326 14 23.4 <0.01 2642 32 82.6 <0.01
TIF002 72896 69963 1.0 0.85 T.O. T.O. - - T.O. T.O. - - T.O. T.O. - -
TIF005 420 260 1.6 0.04 T.O. T.O. - - T.O. T.O. - - 4744 990 4.8 <0.01
TIF006 970 336 2.9 <0.01 52926 25366 2.1 0.02 39810 45124 0.9 0.44 22380 7607 2.9 <0.01
TIF007 50 26 2.0 0.04 15022 1172 12.8 0.04 155 56 2.8 <0.01 8868 290 30.6 <0.01
TIF009 17483 5234 3.3 <0.01 19302 14098 1.4 0.01 26049 15622 1.7 0.05 7696 3406 2.3 <0.01
TIF012 1731 1122 1.5 0.02 28884 16318 1.8 0.09 3639 953 3.8 0.02 1257 637 2.0 <0.01
TIF014 2555 682 3.7 <0.01 66072 7303 9.0 <0.01 1660 788 2.1 <0.01 4666 3275 1.4 <0.01
LUA004 20046 11939 1.7 0.03 44954 22434 2.0 0.03 12401 2098 5.9 <0.01 35998 6937 5.2 <0.01
XML001 3720 1330 2.8 <0.01 T.O. T.O. - - 37726 16801 2.2 0.03 T.O. T.O. - -
XML003 2373 1394 1.7 0.04 T.O. T.O. - - 56501 18636 3.0 0.03 12252 6844 1.8 0.04
XML009 2668 1452 1.8 <0.01 T.O. 22753 3.8 0.06 706 670 1.1 0.65 10827 8766 1.2 0.04
XML017 52 15 3.5 <0.01 26 7 3.7 <0.01 6 16 0.4 <0.01 77 35 2.2 <0.01
SSL001 23533 1583 14.9 <0.01 81838 T.O. 0.9 - 43952 16033 2.7 <0.01 66778 17764 3.8 <0.01
SSL003 193 178 1.1 <0.01 90 63 1.4 <0.01 96 79 1.2 <0.01 225 199 1.1 0.33
PHP004 66156 34974 1.9 0.03 82 89 0.9 0.03 760 185 4.1 0.03 ∅ ∅ - -
PHP009 11890 5677 2.1 0.02 143 179 0.8 0.21 650 596 1.1 0.91 ∅ ∅ - -
PHP011 1626 934 1.7 0.06 108 76 1.4 0.06 19 13 1.5 0.04 ∅ ∅ - -
SQL018 10355 4961 2.1 0.04 71038 32073 2.2 0.04 22656 28101 0.8 0.56 15498 8159 1.9 <0.01
Speedup 15.3× 20.4× 5.5× 10.6×

Table 4: Overview of selected fuzzers in the evaluation For coverage-guided fuzzers, integrating Locus also yields sub-
stantial gains: AFL++ [27], AFL [99], MOPT [57], and FOX [75]
Fuzzer Category Description achieve 15.3×, 20.4×, 5.5×, and 10.6× faster TTE on average, with
four more vulnerabilities found for AFL than those without Locus
AFLGo [6] directed Distance-based seeds scheduling
SelectFuzz [56] directed Selective path exploration within the 24-hour time window.
Beacon [35] directed Fuzzer with efficient path pruning
Titan [36] directed Targets correlations inference
AFL [99] coverage-guided Evolutionary mutation strategies 4.3 RQ2: Cost Analysis
AFL++ [27] coverage-guided Community-enhanced AFL
MOPT [57] coverage-guided Fuzzer with Swarm Optimization We measure the time cost introduced by Locus and the token cost
Fox [75] coverage-guided Online stochastic control incurred by LLM inference.
Time cost In Section 4.2, we focused on measuring TTE, i.e., the
time required to trigger the target canary during fuzzing. However,
only focusing on the fuzzing phase can be misleading when evalu-
In summary, Locus consistently helps all kinds of fuzzers for ating the overall effectiveness of a directed fuzzer. Many directed
vulnerability reproduction. When integrating Locus to directed fuzzers perform expensive static analysis on the target program
fuzzers, AFLGo [6], SelectFuzz [56], Beacon [35], and Titan [36] before starting fuzzing [6, 36, 37]. Likewise, Locus requires addi-
achieve 17.0×, 214.2×, 22.1×, and 28.0× faster TTE on average, with tional preprocessing steps, including codebase indexing, symbolic
five more vulnerabilities found for AFLGo and one more vulnera- validation of predicates, and agentic predicate synthesis. While
bility found for SelectFuzz than those without Locus within the such preprocessing costs are often arguably a one-time effort that
fixed 24-hour time window. can be amortized during fuzzing, in practice, we observe that some
8
Table 5: Average deploy cost for Locus and directed fuzzers. Table 6: TTEs by ablating different designs and models.
All times are calculated in seconds. T.O. indicates that the
fuzzer failed to instrument the target library within 24 hours. PNG007 SND001 TIF012 TIF014 SQL018
Abalate different designs
Target PNG SND TIF LUA XML SSL PHP SQL
AFL++
Size (LoC) 95k 83k 95k 21k 320k 630k 1.6M 387k
origin 53104 451 1731 2555 10355
Index 11 34 82 9 76 146 244 137 +base 54830 389 1904 1226 11573
Synthesis 373 331 212 178 215 384 412 349 +refine T.O. 23 4417 33931 54268
Validation 261 133 231 280 475 824 353 407
+valid 41101 19 1122 682 4961
Total 645 498 525 467 766 1354 1009 893
#Tokens (k) 309 303 256 176 653 598 894 467
SelectFuzz
origin 72078 7764 9118 58851 53397
AFLGo 122 673 2689 85 5608 24799 T.O. 15630 +base 79348 2640 8938 12630 48059
SelectFuzz 84 199 1167 44 807 2597 4554 383 +refine T.O. 6 10805 T.O. T.O.
Beacon 64 113 171 35 1656 T.O. T.O. 3721
+valid 8537 5 4787 7833 6748
Titan 96 186 967 49 2936 T.O. T.O. 4965
Abalate different LLMs
AFL++
approaches incur excessive analysis time, sometimes even longer
origin 53104 451 1731 2555 10355
than the time required to discover the bug by the fuzzer.
w/ o3-mini 41101 19 1122 682 4961
To provide a more comprehensive evaluation, we report the de-
w/ DR1 45104 53 992 1439 5433
tailed preprocessing time overhead incurred by Locus and baseline
w/ Gemini 38215 130 1517 2454 5274
fuzzers. As shown in Table 5, the overhead of the baseline expo-
nentially grows with the size and complexity of the codebase. In SelectFuzz
contrast, Locus relies only on lightweight analysis tools, e.g., code origin 72078 7764 9118 58851 53397
retrieval, graph traversal, etc., so it remains efficient regardless of w/ O3-mini 8537 5 4787 7833 6748
the project size. Particularly, Locus outperforms the best baseline w/ R1 23699 6 3562 3518 7023
fuzzer SelectFuzz by 4.5× when evaluated on the largest program w/ Gemini 7020 12 7206 40332 3921
in the Magma benchmark (PHP).
LLM token cost We assess the feasibility of deploying Locus in
practice from a financial perspective, focusing on the monetary cost
of LLM token usage. Table 5 shows the token cost for our Locus fuzzing, they are not consistently beneficial. Without validation,
workflow. Locus takes 457k tokens (equivalent to $0.72 USD) to Locus sometimes generates false predicates to the program, as
generate predicates for one sample in the Magma benchmark on evidenced by the cases PNG007, TIF014, and SQL018.
average. The monetary token costs show the potential for Locus Varying models Besides the default o3-mini model, we consider
to act as an affordable step in assisting fuzzing. Deepseek R1 [18] and Gemini 2.0 Flash [30] to study the generality
of our agentic framework to different LLM architectures. Table 6
4.4 RQ3: Ablations shows that Locus generalizes to different LLM architectures, except
We ablate the design choice of different modules in Locus and for one case where Gemini fails to bring significant improvement
study how they generalize to various LLMs in Table 6. We pick five to TIF014. We investigate this case and find that Gemini failed to
representative vulnerabilities that fall under different vulnerabil- elevate the generated predicate to the caller closer to the input,
ity categories in Common Weakness Enumerations (CWEs). The such that the additional overhead introduced by evaluating the
evaluation compares two representative mainstream fuzzing tools, predicates outweighs the benefit it brings to the fuzzing.
AFL++ from the coverage-guided fuzzers and SelectFuzz from the
directed fuzzers. 4.5 RQ4: Detecting New Vulnerabilities
Effectiveness of each design component We ablate the individ- We integrate Locus into a real-world vulnerability detection work-
ual design and measure the average TTE obtained by the resulting flow and apply the pipeline to a set of well-fuzzed targets, such as
predicates. We begin with the base setting, where we only keep VLC [83], libarchive [50], and libming [51]. To construct meaning-
the initial synthesized predicate (Algorithm 1 line 7). Next, we ap- ful fuzzing targets, we generate new canaries through two primary
ply the refinement strategy, allowing the synthesized predicate approaches. First, we leverage alerts produced by the static anal-
to propagate to a better program point without validating its se- ysis tool SVF [80], which identify potentially vulnerable program
mantic correctness. Finally, we evaluate the full pipeline of Locus points based on memory access patterns, aliasing behavior, or use-
including both the syntax and semantic validation. after-free risks. Second, we derive reachability canaries for program
Table 6 shows that the design choices in Locus consistently points associated with previously patched bugs. The rationale be-
improves the performance. Specifically, we observe that while pred- hind this strategy is that many real-world bugs occur in clusters or
icates generated under the base setting may occasionally accelerate evolve from incomplete fixes [87].
9
Table 7: New vulnerabilities detected fuzzers with Locus. For Fixing Patch Canary by LOCUS
each newly found vulnerability, we run AFL++ and Select- --- a/src/expr.c
if( pCollName->n > 1 &&

Fuzz w/ and w/o Locus. We exclude the detailed software +++ b/src/expr.c
(pCollName->z[0]=='"' ||

@@ -73,7 +73,7 @@ Expr *AddCollateToken(


pCollName->z[0]=='`' ||

information per vulnerability report guidelines. if( pCollName->n>0 ){


pCollName->z[0]=='[') &&

- Expr *pNew = sqlite3ExprAlloc(pParse,


!dequote ){ ... }
TK_COLLATE, pCollName, 1);

+ Expr *pNew = sqlite3ExprAlloc(pParse,


Ground Truth
AFL++ SelectFuzz TK_COLLATE, pCollName, dequote);

ID Vul. Type if( pNew ){ if(!dequote ){...}


origin Locus origin Locus
VLC-29163 Memory leak T.O. 30847 T.O. 26317 Figure 3: Locus can sometimes synthesize canary conditions
VLC-29162 OOB access T.O. 74835 T.O. T.O. even more precisely than those in Magma, even when pro-
VLC-29238 Memory leak T.O. 21085 53872 24766 vided only with the security patch. Prior to the fix, a dequot-
VLC-29239 Use-after-free 43680 3946 22983 5405 ing flag was erroneously set to 1, enabling attackers to access
libming-365 Null deref T.O. 80241 T.O. T.O.
uninitialized memory by forcing dequoting.
libarchive-hvqg Null deref 83622 34327 62748 18309
libarchive-fm54 OOB access T.O. 16397 46577 23280
(where Magma relied on manual analysis), such that the unpatched
version satisfies the generated canary, while the patched version
We launch 42 fuzzing samples in total, with each fuzzing cam- violates it. By manually checking all the canaries generated by the
paign lasting for 24 hours. We found seven previously undiscovered LLMs, we confirm that 27 out of 28 security patches can be correctly
bugs, including memory leaks, use-after-free, and out-of-bound ac- translated into the same condition.
cess (Table 7). At the time of writing, we responsibly reported all We find the remaining case to be intriguing: the LLM produces
the new vulnerabilities to the relevant maintainers, and one of the a more precise vulnerability canary than the original one in the
maintainers has already drafted a fix patch. To further evaluate Magma benchmark, as shown in Figure 3). Specifically, the vulner-
the efficiency improvement, we also repeat the fuzzing campaign ability is found in SQLite3 and classified as CWE-908, i.e., use of
by fuzzers without Locus. The results show that without Locus, the uninitialized resource. When building a collate token, a con-
AFL++ alone can only detect 2/7 vulnerabilities, and SelectFuzz stant value 1 is incorrectly passed in instead of the dequote flag.
alone can only detect 3/7 vulnerabilities. Section 4.6 elaborates on This allows attackers to access uninitialized memory via forcibly
one newly detected vulnerability. dequoting, which would lead to memory corruption. The original
vulnerability canary only checks whether dequoting is disabled,
4.6 Case Study and invalidates cases for unquoted names since they would not be
dequoted even if the dequote flag were set.
Timeout cases Section 4.2 demonstrates that the synthesized pred- Instead, the canary generated by Locus focuses specifically on
icates can occasionally yield negligible improvement. For example, cases in which a quoted string is mishandled. Such a canary char-
the predicates associated with vulnerability TIF009 significantly acterizes the real vulnerability more precisely and can potentially
improve the performance of all evaluated fuzzers except for Beacon reduce the false positives introduced by the imprecise ground truth
and Titan. On average, these predicates bring 2× speedup to all canaries.
the other fuzzers, while Beacon and Titan always timeout with or
Real-world vulnerability We elaborate on the vulnerability
without Locus synthesized predicates.
libarchive-fm54 in Table 7, an out-of-bounds error can occur
Upon further investigation, we find that their employed termina-
when the program processes a specially crafted RAR file. The vul-
tion mechanism can sometimes conflict with the canaries. Specifi-
nerability stems from insufficient bound checking in the delta filter
cally, Beacon and Titan perform static analysis to identify functions
logic used during decompression. In the RAR format, filters are
potentially reachable from the vulnerability canary, and primarily
lightweight data transformation routines applied to improve com-
prioritize program inputs that can reach the statically identified
pression efficiency. One common example is the delta filter, which
functions. The predicates synthesized by Locus thus offer only
is often used on binary data, such as audio or image streams, to con-
limited guidance when they are put in the functions that are not
vert absolute values into relative differences, thus making patterns
statically identified as close to the canary.
more compressible. During decompression, the filter operates on
Canary generation by Locus Section 4.5 discusses our attempts two memory regions: the source buffer src, which holds the raw
to generate canaries for arbitrary types of vulnerability definitions decompressed data, and the destination buffer dst, which stores
without relying on existing ones. This shows that Locus does not the filtered output. This vulnerability will be triggered when src
rely on a pre-defined canary to be applicable. Instead, it can be inadvertently points into the dst buffer. This overlap can lead to
extended to automatically generate canaries based on diverse rep- undefined behavior, including memory corruption, out-of-bounds
resentations of the target states, such as the patch and bug descrip- access, or program crashes, as the filter may read from or write to
tions. memory locations outside the valid range.
To apply Locus in such settings, we also consider letting Locus Figure 4 shows the predicates synthesized by Locus that check
generate canary conditions based on available security patches. the input is a valid RAR format file and that the delta filter process-
Specifically, we iterate over all security patches in Magma and ing logic will be triggered during decompression. It is seemingly
prompt Locus to automate the generation of a canary condition more helpful to generate the predicate to compare the address of
10
src and dst. However, these pointers are only assigned and com- them susceptible to fundamental challenges in program analysis,
puted within the vulnerable function itself. By the time this function such as path explosions and alias dependencies [21, 39, 48, 70],
is reached, it can be too late for such a predicate to effectively guide while also pose challenges to LLMs in terms of the bloated context
the fuzzer. length and unbounded errors.
As opposed to solely generating inputs or fuzzing harnesses, Lo-
Check the RAR format cus advances LLM-based fuzzing by constraining its generation at
the level of progress-capturing predicates. Such a task formulation
int code = format_code & FORMAT_BASE_MASK

if (code == FORMAT_RAR_V5 || code == FORMAT_RAR){ enables the LLM to operate within the (relatively) local context,
allowing Locus to detect certain code behaviors that only emerge
Check the delta filter midway through execution to inform subsequent input searches,
uint64_t fp= filter->prog->fingerprint;

and also facilitates the rigorous validation of potential LLM errors.


if (fp == 0x1D0E06077D) { ... As a result, this strategy can effectively sidestep reasoning across
lengthy function call chains, enabling a greater focus on more local,
Fixing Patch intermediate function contexts.
diff --git a/read_format_rar.c b/read_format_rar.c
LLM-driven proof synthesis Locus shares a similar spirit with
index 2dd0ea34..7f0ad199 100644

@@ -3708,6 +3708,8 @@ filter_delta(filter *f, ...


recent research that focuses on reasoning about loop invariants and
uint8_t lastbyte = 0;
program specifications to synthesize proofs and automate program
for (idx = i; idx < length; idx += channels) {

+ if (src >= dst)

verification [9, 13, 42, 55, 92]. Locus resembles these approaches in
+ return 0;
the sense that they also integrate LLMs with rule-based verifiers,
lastbyte = dst[idx] = lastbyte - *src++;
such as theorem provers or SMT solvers, to check the validity of
}

synthesized properties. However, Locus relaxes the requirement


that the synthesized specifications must facilitate rigorous proof
Figure 4: A previously unknown vulnerability in libarchive steps, but treats the predicates as best-effort guidance for input
search. It bridges the gap between formal verification and dynamic
testing, leveraging the reasoning capabilities of LLMs to identify
5 Related Work meaningful program states while prioritizing efficiency and gener-
ality over formal guarantees.
Direct grey-box fuzzing Directed grey-box fuzzing is particularly
challenging, primarily due to the prohibitively large search space
with sparse rewards [46, 56], i.e., unlike undirected fuzzing where
any new coverage indicates progress. Most prior works develop 6 Threats to Validity
heuristics through static analysis by computing path distances to
Threats to predicate validation While Locus leverages symbolic
target states to narrow down the search space [6, 12, 36, 43, 46, 56,
execution to bound the error of the synthesizer, it inherits all of its
74, 75, 79]. For example, AFLGo [6] introduces the idea of using
limitations. For example, we still face the path explosion problem,
distance metrics between test inputs and target basic blocks to
especially when there are loops between the generated predicates
prioritize seeds that are closer to the target, while CAFL [43] further
and the canary. While we adopt the common strategies to address
improves this metric by introducing specialized progress-capturing
these limitations, e.g., loop unrolling, we cannot formally guarantee
state representation and calculating the distance from the testing
that the relaxation brought by the generated predicates is always
inputs to the nearest state.
valid and can thus effectively guide fuzzing. This leads to orthogonal
Similar to Locus, some recent approaches [2, 61, 68] also explore
but interesting future directions on loop invariant reasoning and
rewriting the program to direct execution towards hard-to-reach
function summary (using LLMs) for symbolic execution.
states. These approaches focus on checking sophisticated conditions
uncovered by human experts, e.g., temporal properties in network Threats to LLM types Second, our evaluation is currently limited
protocols, and may not generalize to broader types of target states to a specific set of LLMs and benchmark programs. The generality
and can also incur expensive manual effort. In contrast, Locus offers of our results to other models, especially open-weight LLMs and
a generalizable framework for diverse types of programs and target broader software systems, remains to be validated. The performance
states with minimal human intervention. of our approach may depend on the underlying LLM capability
and training data. Additionally, our evaluated benchmark Magma,
LLM-based fuzzing LLMs have demonstrated promising code
though built on popular software and real-world vulnerabilities,
reasoning capabilities [22, 23, 28, 31, 84, 93]. There has been grow-
may not capture the comprehensive challenges present in diverse
ing interest in applying them to support fuzzing [17]. Most existing
software.
approaches utilize LLMs to directly generate test inputs for the
target program [19, 20, 34, 89, 91, 94, 103, 104] or build grammar- Threats to data leakage While the task formulation, i.e., predicate
aware input generators (i.e., fuzzing harness) to constrain the search synthesis for directed fuzzing, is arguably hard to suffer from the
space [37, 52, 53, 77, 101]. While these methods demonstrate the po- data leakage problem, the software project in our evaluation is
tential of LLMs to accelerate fuzzing, they often require LLMs to rea- largely included in the training data of any major LLMs. As the
son about the target program states all the way to the inputs, making ultimate goal is to detect and confirm vulnerabilities, we believe that
11
such a threat is minor compared to the practical security impact, [16] Samuel Colvin. 2025. PydanticAI. https://ai.pydantic.dev/ Version 0.4.3.
as we have shown in Section 4.6. [17] DARPA. 2024. DARPA AI Cyber Challenge. https://aicyberchallenge.com/
[18] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu
Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang
7 Conclusion Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, and Zhihong Shao. 2025.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement
This paper presented Locus, a novel framework for enhancing Learning. doi:10.48550/arXiv.2501.12948 arXiv:2501.12948 [cs].
directed fuzzing through synthetic progress-capturing predicates. [19] Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming
Zhang. 2023. Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-
With an agentic synthesizer-validator architecture, Locus effec- Learning Libraries via Large Language Models. In Proceedings of the 32nd ACM
tively guides fuzzing via the predicates while also ensuring any Sigsoft International Symposium on Software Testing and Analysis. ACM, Seattle
errors in the synthetic predicates are fixed by symbolic execution. WA USA, 423–435. doi:10.1145/3597926.3598067
[20] Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shu-
Our evaluation demonstrated that Locus substantially improves the jing Yang, and Lingming Zhang. 2024. Large language models are edge-case
state-of-the-art fuzzers. So far, it has uncovered seven previously generators: Crafting unusual programs for fuzzing deep learning libraries. In
Proceedings of the 46th IEEE/ACM international conference on software engineer-
unpatched vulnerabilities across three software, with one already ing.
acknowledged with a draft patch. [21] Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun
Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. 2024.
Vulnerability detection with code language models: How far are we? arXiv
References preprint arXiv:2403.18624 (2024).
[1] Cornelius Aschermann, Sergej Schumilo, Ali Abbasi, and Thorsten Holz. 2020. [22] Yangruibo Ding, Jinjun Peng, Marcus Min, Gail Kaiser, Junfeng Yang, and
Ijon: Exploring Deep State Spaces via Fuzzing. In 2020 IEEE Symposium on Baishakhi Ray. 2024. Semcoder: Training code language models with com-
Security and Privacy (SP). 1597–1612. doi:10.1109/SP40000.2020.00117 ISSN: prehensive semantics reasoning. Advances in Neural Information Processing
2375-1207. Systems 37 (2024), 60275–60308.
[2] Jinsheng Ba, Marcel Böhme, Zahra Mirzamomen, and Abhik Roychoudhury. [23] Yangruibo Ding, Benjamin Steenhoek, Kexin Pei, Gail Kaiser, Wei Le, and
2022. Stateful greybox fuzzing. In 31st USENIX security symposium (USENIX Baishakhi Ray. 2024. TRACED: Execution-aware Pre-training for Source Code.
security 22). USENIX Association, Boston, MA, 3255–3272. https://www.usenix. In Proceedings of the IEEE/ACM 46th International Conference on Software Engi-
org/conference/usenixsecurity22/presentation/ba neering. ACM, Lisbon Portugal, 1–12. doi:10.1145/3597503.3608140
[3] Davide Balzarotti. 2021. The use of likely invariants as feedback for fuzzers. In [24] Zhengjie Du, Yuekang Li, Yang Liu, and Bing Mao. 2022. WindRanger: a
30th USENIX Security Symposium (USENIX Security 21). directed greybox fuzzer driven by deviation basic blocks. In Proceedings of
[4] Marcel Böhme, Bruno C d S Oliveira, and Abhik Roychoudhury. 2013. Regression the 44th International Conference on Software Engineering. ACM, Pittsburgh
tests to expose change interaction errors. In Proceedings of the 2013 9th Joint Pennsylvania, 2440–2451. doi:10.1145/3510003.3510197
Meeting on Foundations of Software Engineering. [25] Rafael Dutra, Rahul Gopinath, and Andreas Zeller. 2023. Formatfuzzer: Effective
[5] David Brumley, Pongsin Poosankam, Dawn Song, and Jiang Zheng. 2008. Auto- fuzzing of binary file formats. ACM Transactions on Software Engineering and
matic patch-based exploit generation is possible: Techniques and implications. Methodology 33, 2 (2023), 1–29.
In 2008 IEEE Symposium on Security and Privacy (sp 2008). [26] Andrea Fioraldi, Daniele Cono D’Elia, and Davide Balzarotti. 2021. The Use of
[6] Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoud- Likely Invariants as Feedback for Fuzzers. In 30th USENIX Security Symposium
hury. 2017. Directed Greybox Fuzzing. In Proceedings of the 2017 ACM Sigsac (USENIX Security 21). USENIX Association, 2829–2846. https://www.usenix.
Conference on Computer and Communications Security. ACM, Dallas Texas USA, org/conference/usenixsecurity21/presentation/fioraldi
2329–2344. doi:10.1145/3133956.3134020 [27] Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. AFL++
[7] Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. 2008. KLEE: unassisted : Combining incremental steps of fuzzing research. In 14th Usenix Workshop on
and automatic generation of high-coverage tests for complex systems programs. Offensive Technologies (woot 20). USENIX Association. https://www.usenix.org/
In 8th USENIX Symposium on Operating Systems Design and Implementation, conference/woot20/presentation/fioraldi
OSDI 2008, December 8-10, 2008, San Diego, California, USA, Proceedings, Richard [28] Linyuan Gong, Sida Wang, Mostafa Elhoushi, and Alvin Cheung. 2024. Eval-
Draves and Robbert van Renesse (Eds.). USENIX Association, 209–224. http:// uation of llms on syntax-aware code fill-in-the-middle tasks. arXiv preprint
www.usenix.org/events/osdi08/tech/full_papers/cadar/cadar.pdf tex.bibsource: arXiv:2403.04814 (2024).
dblp computer science bibliography, https://dblp.org tex.timestamp: Thu, 12 [29] Peter Goodman. 2025. Multiplier. https://github.com/trailofbits/multiplier
Mar 2020 11:35:55 +0100. Version 1705339.
[8] Sicong Cao, Biao He, Xiaobing Sun, Yu Ouyang, Chao Zhang, Xiaoxue Wu, [30] Google DeepMind. 2024. Gemini 2.0 Flash. https://deepmind.google/
Ting Su, Lili Bo, Bin Li, Chuanlei Ma, et al. 2023. Oddfuzz: Discovering java technologies/gemini/flash/. Accessed: 2025-03-29.
deserialization vulnerabilities via structure-aware directed greybox fuzzing. In [31] Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Syn-
2023 IEEE Symposium on Security and Privacy (SP). IEEE. naeve, and Sida I Wang. 2024. CRUXEval: A Benchmark for Code Reasoning,
[9] Saikat Chakraborty, Shuvendu K Lahiri, Sarah Fakhoury, Madanlal Musuvathi, Understanding and Execution. In Proceedings of the 41st International Conference
Akash Lal, Aseem Rastogi, Aditya Senthilnathan, Rahul Sharma, and Nikhil on Machine Learning. 16568–16621.
Swamy. 2023. Ranking llm-generated loop invariants for program verification. [32] Ahmad Hazimeh, Adrian Herrera, and Mathias Payer. 2020. Magma: A Ground-
arXiv preprint arXiv:2310.09342 (2023). Truth Fuzzing Benchmark. In Proceedings of the ACM on Measurement and
[10] Chuyang Chen, Brendan Dolan-Gavitt, and Zhiqiang Lin. 2025. ELFuzz: Efficient Analysis of Computing Systems, Vol. 4. 1–29. doi:10.1145/3428334
Input Generation via LLM-driven Synthesis Over Fuzzer Space. arXiv preprint [33] Christian Holler, Kim Herzig, and Andreas Zeller. 2012. Fuzzing with code
arXiv:2506.10323 (2025). fragments. In 21st USENIX Security Symposium (USENIX Security 12).
[11] Hongxu Chen, Yinxing Xue, Yuekang Li, Bihuan Chen, Xiaofei Xie, Xiuheng Wu, [34] Jie Hu, Qian Zhang, and Heng Yin. 2023. Augmenting greybox fuzzing with
and Yang Liu. 2018. Hawkeye: Towards a Desired Directed Grey-box Fuzzer. In generative ai. arXiv preprint arXiv:2306.06782 (2023).
Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communica- [35] Heqing Huang, Yiyuan Guo, Qingkai Shi, Peisen Yao, Rongxin Wu, and Charles
tions Security. ACM, Toronto Canada, 2095–2108. doi:10.1145/3243734.3243849 Zhang. 2022. BEACON: Directed Grey-Box Fuzzing with Provable Path Pruning.
[12] Peng Chen and Hao Chen. 2018. Angora: Efficient Fuzzing by Principled Search. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, San Francisco, CA,
In 2018 IEEE Symposium on Security and Privacy (SP). 711–725. doi:10.1109/SP. USA, 36–50. doi:10.1109/SP46214.2022.9833751
2018.00046 ISSN: 2375-1207. [36] Heqing Huang, Peisen Yao, Hung-Chun Chiu, Yiyuan Guo, and Charles Zhang.
[13] Tianyu Chen, Shuai Lu, Shan Lu, Yeyun Gong, Chenyuan Yang, Xuheng Li, 2024. Titan : Efficient Multi-target Directed Greybox Fuzzing. In 2024 IEEE
Md Rakib Hossain Misu, Hao Yu, Nan Duan, Peng Cheng, et al. 2024. Automated Symposium on Security and Privacy (sp). IEEE, San Francisco, CA, USA, 1849–
proof generation for rust code via self-evolution. arXiv preprint arXiv:2410.15756 1864. doi:10.1109/SP54263.2024.00059
(2024). [37] Heqing Huang, Anshunkang Zhou, Mathias Payer, and Charles Zhang. 2024.
[14] Yuanliang Chen, Yu Jiang, Fuchen Ma, Jie Liang, Mingzhe Wang, Chijin Zhou, Everything is Good for Something: Counterexample-Guided Directed Fuzzing
Xun Jiao, and Zhuo Su. 2019. EnFuzz: Ensemble fuzzing with seed synchro- via Likely Invariant Inference. In 2024 IEEE Symposium on Security and Privacy
nization among diverse fuzzers. In 28th USENIX Security Symposium (USENIX (sp). 1956–1973. doi:10.1109/SP54263.2024.00142 ISSN: 2375-1207.
Security 19). [38] Kyriakos Ispoglou, Daniel Austin, Vishwath Mohan, and Mathias Payer. 2020.
[15] Maria Christakis, Peter Müller, and Valentin Wüstholz. 2016. Guiding dynamic { FuzzGen } : Automatic fuzzer generation. In 29th USENIX Security Symposium
symbolic execution toward unverified program executions. In Proceedings of (USENIX Security 20).
the 38th International Conference on Software Engineering.
12
[39] Zongze Jiang, Ming Wen, Jialun Cao, Xuanhua Shi, and Hai Jin. 2024. Towards Pennsylvania, 1343–1355. doi:10.1145/3510003.3510082
Understanding the Effectiveness of Large Language Models on Directed Test [62] Charalambos Mitropoulos, Thodoris Sotiropoulos, Sotiris Ioannidis, and Dim-
Input Generation. In Proceedings of the 39th IEEE/ACM International Conference itris Mitropoulos. 2023. Syntax-aware mutation for testing the solidity compiler.
on Automated Software Engineering. In European Symposium on Research in Computer Security. Springer.
[40] Tae Eun Kim, Jaeseung Choi, Kihong Heo, and Sang Kil Cha. 2023. DAFL: [63] Aniruddhan Murali, Noble Mathews, Mahmoud Alfadel, Meiyappan Nagappan,
Directed grey-box fuzzing guided by data dependency. In 32nd USENIX Security and Meng Xu. 2024. Fuzzslice: Pruning false positives in static analysis warnings
Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 4931– through function-level fuzzing. In Proceedings of the 46th IEEE/ACM International
4948. https://www.usenix.org/conference/usenixsecurity23/presentation/kim- Conference on Software Engineering.
tae-eun [64] Manh-Dung Nguyen, Sébastien Bardin, Richard Bonichon, Roland Groz, and
[41] George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Matthieu Lemerre. 2020. Binary-level directed fuzzing for { use-after-free }
Evaluating Fuzz Testing. In Proceedings of the 2018 ACM SIGSAC Conference vulnerabilities. In 23rd International Symposium on Research in Attacks, Intrusions
on Computer and Communications Security. ACM, Toronto Canada, 2123–2138. and Defenses (RAID 2020).
doi:10.1145/3243734.3243804 [65] Rohan Padhye, Caroline Lemieux, Koushik Sen, Laurent Simon, and Hayawardh
[42] Andrei Kozyrev, Gleb Solovev, Nikita Khramov, and Anton Podkopaev. 2024. Vijayakumar. 2019. Fuzzfactory: domain-specific fuzzing with waypoints. Pro-
CoqPilot, a plugin for LLM-based generation of proofs. In Proceedings of the ceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1–29.
39th IEEE/ACM International Conference on Automated Software Engineering. [66] Jibesh Patra and Michael Pradel. 2016. Learning to fuzz: Application-
[43] Gwangmu Lee, Woochul Shim, and Byoungyoung Lee. 2021. Constraint-guided independent fuzz testing with probabilistic, generative models of input data.
Directed Greybox Fuzzing. In 30th Usenix Security Symposium (usenix Security TU Darmstadt, Department of Computer Science, Tech. Rep. TUD-CS-2016-14664
21). USENIX Association, 3559–3576. https://www.usenix.org/conference/ (2016).
usenixsecurity21/presentation/lee-gwangmu [67] Paul Gauthier. 2024. Aider, AI pair programming in your terminal. https:
[44] Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2024. Enhancing static //aider.chat.
analysis for practical bug detection: An llm-integrated approach. Proceedings of [68] Hui Peng, Yan Shoshitaishvili, and Mathias Payer. 2018. T-Fuzz: Fuzzing by
the ACM on Programming Languages 8, OOPSLA1 (2024), 474–499. Program Transformation. In 2018 IEEE Symposium on Security and Privacy (sp).
[45] Haonan Li, Hang Zhang, Kexin Pei, and Zhiyun Qian. 2025. The Hitchhiker’s IEEE, San Francisco, CA, 697–710. doi:10.1109/SP.2018.00056
Guide to Program Analysis, Part II: Deep Thoughts by LLMs. arXiv preprint [69] Manuel Rigger and Zhendong Su. 2020. Finding bugs in database systems via
arXiv:2504.11711 (2025). query partitioning. Proceedings of the ACM on Programming Languages (2020).
[46] Penghui Li, Wei Meng, and Chao Zhang. 2024. SDFuzz: Target States Driven Di- [70] Niklas Risse and Marcel Böhme. 2024. Uncovering the limits of machine learning
rected Fuzzing. In 33rd Usenix Security Symposium (usenix Security 24). USENIX for automatic vulnerability detection. In 33rd USENIX Security Symposium
Association, Philadelphia, PA, 2441–2457. https://www.usenix.org/conference/ (USENIX Security 24).
usenixsecurity24/presentation/li-penghui [71] Niklas Risse, Jing Liu, and Marcel Böhme. 2025. Top Score on the Wrong
[47] Yuekang Li, Bihuan Chen, Mahinthan Chandramohan, Shang-Wei Lin, Yang Liu, Exam: On Benchmarking in Machine Learning for Vulnerability Detection. In
and Alwen Tiu. 2017. Steelix: program-state based binary fuzzing. In Proceedings Proceedings of the ACM on Software Engineering, Vol. 2. 388–410. doi:10.1145/
of the 2017 11th joint meeting on foundations of software engineering. 3728887
[48] Ziyang Li, Saikat Dutta, and Mayur Naik. 2024. Llm-assisted static analysis for [72] Pranab Sahoo, Prabhash Meharia, Akash Ghosh, Sriparna Saha, Vinija Jain, and
detecting security vulnerabilities. arXiv preprint arXiv:2405.17238 (2024). Aman Chadha. 2024. A comprehensive survey of hallucination in large language,
[49] Hongliang Liang, Lin Jiang, Lu Ai, and Jinyi Wei. 2020. Sequence directed image, video and audio foundation models. arXiv preprint arXiv:2405.09589
hybrid fuzzing. In 2020 IEEE 27th International Conference on Software Analysis, (2024).
Evolution and Reengineering (SANER). IEEE. [73] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy
[50] libarchive contributors. 2025. libarchive: Multi-format archive and compression Vyukov. 2012. { AddressSanitizer } : A fast address sanity checker. In 2012 USENIX
library. https://www.libarchive.org/ Computer software. annual technical conference (USENIX ATC 12). 309–318.
[51] libming contributors. 2025. libming: SWF (Flash) file creation library. https: [74] Abhishek Shah, Dongdong She, Samanway Sadhu, Krish Singal, Peter Coff-
//www.libming.org/ Computer software. man, and Suman Jana. 2022. MC2: Rigorous and Efficient Directed Grey-
[52] Dongge Liu, Oliver Chang, Jonathan Metzman, Martin Sablotny, and Mihai box Fuzzing. In Proceedings of the 2022 ACM SIGSAC Conference on Com-
Maruseac. 2024. OSS-fuzz-gen: Automated fuzz target generation. https: puter and Communications Security. ACM, Los Angeles CA USA, 2595–2609.
//github.com/google/oss-fuzz-gen doi:10.1145/3548606.3560648
[53] Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming [75] Dongdong She, Adam Storek, Yuchong Xie, Seoyoung Kweon, Prashast Sri-
Zhang. 2024. Evaluating language models for efficient code generation. arXiv vastava, and Suman Jana. 2024. FOX: Coverage-guided Fuzzing as Online
preprint arXiv:2408.06450 (2024). Stochastic Control. In Proceedings of the 2024 on ACM SIGSAC Conference on
[54] Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Zhilin Tian, Computer and Communications Security. ACM, Salt Lake City UT USA, 765–779.
Yuekai Huang, Jun Hu, and Qing Wang. 2024. Testing the Limits: Unusual doi:10.1145/3658644.3670362
Text Inputs Generation for Mobile App Crash Detection with Large Language [76] Gabriel Sherman and Stefan Nagy. 2025. No harness, no problem: Oracle-
Model. In Proceedings of the IEEE/ACM 46th International Conference on Software guided harnessing for auto-generating C API fuzzing harnesses. In IEEE/ACM
Engineering. ACM, Lisbon Portugal, 1–12. doi:10.1145/3597503.3639118 International Conference on Software Engineering (ICSE).
[55] Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof automation [77] Wenxuan Shi, Yunhang Zhang, Xinyu Xing, and Jun Xu. 2024. Harnessing
with large language models. In Proceedings of the 39th IEEE/ACM International Large Language Models for Seed Generation in Greybox Fuzzing. arXiv preprint
Conference on Automated Software Engineering. arXiv:2411.18143 (2024).
[56] Changhua Luo, Wei Meng, and Penghui Li. 2023. SelectFuzz: Efficient Directed [78] Prashast Srivastava, Stefan Nagy, Matthew Hicks, Antonio Bianchi, and Mathias
Fuzzing with Selective Path Exploration. In 2023 IEEE Symposium on Security Payer. 2022. One Fuzz Doesn’t Fit All: Optimizing Directed Fuzzing via Target-
and Privacy (sp). IEEE, San Francisco, CA, USA, 2693–2707. doi:10.1109/SP46215. tailored Program State Restriction. In Proceedings of the 38th Annual Computer
2023.10179296 Security Applications Conference. ACM, Austin TX USA, 388–399. doi:10.1145/
[57] Chenyang Lyu, Shouling Ji, Chao Zhang, Yuwei Li, Wei-Han Lee, Yu Song, and 3564625.3564643
Raheem Beyah. 2019. MOPT: Optimized Mutation Scheduling for Fuzzers. [79] Prashast Srivastava, Stefan Nagy, Matthew Hicks, Antonio Bianchi, and Mathias
In 28th USENIX security symposium (USENIX security 19). USENIX Asso- Payer. 2022. One fuzz doesn’t fit all: Optimizing directed fuzzing via target-
ciation, Santa Clara, CA, 1949–1966. https://www.usenix.org/conference/ tailored program state restriction. In Proceedings of the 38th Annual Computer
usenixsecurity19/presentation/lyu Security Applications Conference.
[58] Yunlong Lyu, Yuxuan Xie, Peng Chen, and Hao Chen. 2024. Prompt Fuzzing for [80] Yulei Sui and Jingling Xue. 2016. SVF: interprocedural static value-flow anal-
Fuzz Driver Generation. In Proceedings of the 2024 on ACM SIGSAC Conference ysis in LLVM. In Proceedings of the 25th International Conference on Compiler
on Computer and Communications Security. ACM, Salt Lake City UT USA, 3793– Construction. ACM, Barcelona Spain, 265–266. doi:10.1145/2892208.2892235
3807. doi:10.1145/3658644.3670396 [81] Xin Tan, Yuan Zhang, Jiadong Lu, Xin Xiong, Zhuang Liu, and Min Yang. 2023.
[59] H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random SyzDirect: Directed Greybox Fuzzing for Linux Kernel. In Proceedings of the
Variables is Stochastically Larger than the Other. Annals of Mathematical 2023 ACM SIGSAC Conference on Computer and Communications Security. ACM,
Statistics 18, 1 (March 1947), 50–60. doi:10.1214/aoms/1177730491 Copenhagen Denmark, 1630–1644. doi:10.1145/3576915.3623146
[60] Paul Dan Marinescu and Cristian Cadar. 2013. KATCH: High-coverage testing [82] David Trabish, Andrea Mattavelli, Noam Rinetzky, and Cristian Cadar. 2018.
of software patches. In Proceedings of the 2013 9th Joint Meeting on Foundations Chopped symbolic execution. In Proceedings of the 40th International Conference
of Software Engineering. on Software Engineering. ACM, Gothenburg Sweden, 350–360. doi:10.1145/
[61] Ruijie Meng, Zhen Dong, Jialin Li, Ivan Beschastnikh, and Abhik Roychoudhury. 3180155.3180251
2022. Linear-time temporal logic guided greybox fuzzing. In Proceedings of [83] VideoLAN. 2025. VLC media player. https://www.videolan.org/vlc/ Computer
the 44th International Conference on Software Engineering. ACM, Pittsburgh software.
13
[84] Chengpeng Wang, Wuqi Zhang, Zian Su, Xiangzhe Xu, Xiaoheng Xie, [104] Qiang Zhang, Yuheng Shen, Jianzhong Liu, Yiru Xu, Heyuan Shi, Yu Jiang, and
and Xiangyu Zhang. 2024. LLMDFA: Analyzing dataflow in code Wanli Chang. 2024. ECG: Augmenting Embedded Operating System Fuzzing via
with large language models. In Advances in Neural Information Pro- LLM-Based Corpus Generation. IEEE Transactions on Computer-Aided Design of
cessing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- Integrated Circuits and Systems 43, 11 (2024), 4238–4249.
quet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., [105] Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024.
131545–131574. https://proceedings.neurips.cc/paper_files/paper/2024/file/ AutoCodeRover: Autonomous Program Improvement. In Proceedings of the 33rd
ed9dcde1eb9c597f68c1d375bbecf3fc-Paper-Conference.pdf ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM,
[85] Daimeng Wang, Zheng Zhang, Hang Zhang, Zhiyun Qian, Srikanth V. Krishna- Vienna Austria, 1592–1604. doi:10.1145/3650212.3680384
murthy, and Nael Abu-Ghazaleh. 2021. SyzVegas: Beating kernel fuzzing odds
with reinforcement learning. In 30th USENIX Security Symposium (USENIX Secu-
rity 21). USENIX Association, 2741–2758. https://www.usenix.org/conference/
usenixsecurity21/presentation/wang-daimeng
[86] Junjie Wang, Yuhan Ma, Xiaofei Xie, Xiaoning Du, and Xiangwei Zhang. 2025.
PatchFuzz: Patch Fuzzing for JavaScript Engines. arXiv preprint arXiv:2505.00289
(2025).
[87] Yanhao Wang, Xiangkun Jia, Yuwei Liu, Kyle Zeng, Tiffany Bao, Dinghao Wu,
and Purui Su. 2020. Not All Coverage Measurements Are Equal: Fuzzing by
Coverage Accounting for Input Prioritization.. In NDSS.
[88] Felix Weissberg, Jonas Möller, Tom Ganz, Erik Imgrund, Lukas Pirch, Lukas
Seidel, Moritz Schloegel, Thorsten Eisenhofer, and Konrad Rieck. 2024. SoK:
Where to Fuzz? Assessing Target Selection Methods in Directed Fuzzing. In
Proceedings of the 19th ACM Asia Conference on Computer and Communications
Security. ACM, Singapore Singapore, 1539–1553. doi:10.1145/3634737.3661141
[89] Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Ling-
ming Zhang. 2024. Fuzz4all: Universal fuzzing with large language models. In
Proceedings of the IEEE/ACM 46th International Conference on Software Engineer-
ing.
[90] Hanxiang Xu, Yanjie Zhao, and Haoyu Wang. 2025. Directed Greybox Fuzzing
via Large Language Model. doi:10.48550/arXiv.2505.03425 arXiv:2505.03425
[cs].
[91] Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jab-
barvand, and Lingming Zhang. 2024. WhiteFox: White-Box Compiler Fuzzing
Empowered by Large Language Models. In Object-oriented Programming, Sys-
tems, Languages, and Applications, Vol. 8. 709–735. doi:10.1145/3689736
[92] Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Jianan Yao, Weidong Cui,
Yeyun Gong, Chris Hawblitzel, Shuvendu Lahiri, Jacob R Lorch, Shuai Lu, et al.
2024. AutoVerus: Automated proof generation for Rust code. arXiv preprint
arXiv:2409.13082 (2024).
[93] Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, and Lingming Zhang. 2025.
KNighter: Transforming Static Analysis with LLM-Synthesized Checkers. doi:10.
48550/arXiv.2503.09002 arXiv:2503.09002 [cs].
[94] Chenyuan Yang, Zijie Zhao, and Lingming Zhang. 2023. Kernelgpt: Enhanced
kernel fuzzing via large language models. arXiv preprint arXiv:2401.00563 (2023).
[95] John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao,
Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer
Interfaces Enable Automated Software Engineering. In Advances in Neu-
ral Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave,
A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates,
Inc., 50528–50652. https://proceedings.neurips.cc/paper_files/paper/2024/file/
5a7c947568c1b1328ccc5230172e1e7c-Paper-Conference.pdf
[96] Yupeng Yang, Shenglong Yao, Jizhou Chen, and Wenke Lee. 2025. Hybrid
Language Processor Fuzzing via LLM-Based Constraint Solving. In 34th USENIX
Security Symposium (USENIX Security 25).
[97] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R
Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing reasoning and acting in
language models. In The Eleventh International Conference on Learning Repre-
sentations. https://openreview.net/forum?id=WE_vluYUL-X
[98] Zijun Yao, Yantao Liu, Yanxu Chen, Jianhui Chen, Junfeng Fang, Lei Hou,
Juanzi Li, and Tat-Seng Chua. 2025. Are Reasoning Models More Prone to
Hallucination? arXiv preprint arXiv:2505.23646 (2025).
[99] Michal Zalewski. 2020. American Fuzzy Lop. https://github.com/google/AFL
[100] Cen Zhang, Yuekang Li, Hao Zhou, Xiaohan Zhang, Yaowen Zheng, Xian
Zhan, Xiaofei Xie, Xiapu Luo, Xinghua Li, Yang Liu, et al. 2023. Automata-
Guided Control-Flow-Sensitive Fuzz Driver Generation.. In USENIX Security
Symposium.
[101] Cen Zhang, Yaowen Zheng, Mingqiang Bai, Yeting Li, Wei Ma, Xiaofei Xie,
Yuekang Li, Limin Sun, and Yang Liu. 2024. How effective are they? Exploring
large language model based fuzz driver generation. In Proceedings of the 33rd
ACM SIGSOFT International Symposium on Software Testing and Analysis.
[102] Cen Zhang, Yaowen Zheng, Mingqiang Bai, Yeting Li, Wei Ma, Xiaofei Xie,
Yuekang Li, Limin Sun, and Yang Liu. 2024. How Effective Are They? Exploring
Large Language Model Based Fuzz Driver Generation. In Proceedings of the 33rd
ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM,
Vienna Austria, 1223–1235. doi:10.1145/3650212.3680355
[103] Hongxiang Zhang, Yuyang Rong, Yifeng He, and Hao Chen. 2024. Lla-
mafuzz: Large language model enhanced greybox fuzzing. arXiv preprint
arXiv:2406.07714 (2024).

14

You might also like