Locus
Locus
                                        Abstract                                                                          DGF is challenging, as the target program states are often deeply
                                        Directed fuzzing aims to find program inputs that lead to specified           nested in the program, while the search space introduced by the
                                        target program states. It has broad applications, such as debugging           complexity of real-world software is prohibitively large. To speed
                                        system crashes, confirming reported bugs, and generating exploits             up the search and schedule promising inputs, most existing tech-
                                        for potential vulnerabilities. This task is inherently challenging            niques leverage the control flow proximity, e.g., the distance to
                                        because target states are often deeply nested in the program, while           target location in the control flow graph, or heuristics based on the
                                        the search space manifested by numerous possible program inputs               semantics of the branch predicates [12, 47, 74], e.g., the target state
                                        is prohibitively large. Existing approaches rely on branch distances          if x==42 has the distance metric of |𝑥 −42|. However, such feedback
                                        or manually-specified constraints to guide the search; however,               is sometimes too sparse or indirect to reliably measure the progress,
                                        the branches alone are often insufficient to precisely characterize           especially when there is a long chain of implicit preconditions
                                        progress toward reaching the target states, while the manually                guarding the target program states [1, 25, 33, 40, 41, 62, 69, 76, 100].
                                        specified constraints are often tailored for specific bug types and           For example, triggering CVE-2018-13785 in libpng requires a PNG
                                        thus difficult to generalize to diverse target states and programs.           file to satisfy a precise sequence of preconditions, i.e., valid sig-
                                           We present Locus, a novel framework to improve the efficiency              nature, correct chunk layout, specific IHDR fields (e.g., bit depth,
                                        of directed fuzzing. Our key insight is to synthesize predicates to           color type), and a magic image width (0x55555555), to trigger an
                                        capture fuzzing progress as semantically meaningful intermediate              integer overflow [40], while the predicates to explicitly check these
                                        states, serving as milestones towards reaching the target states.             conditions are largely absent in the code to provide an incremental
                                        When used to instrument the program under fuzzing, they can re-               progress guidance.
                                        ject executions unlikely to reach the target states, while providing              To capture the intricate feedback to improve search efficiency,
                                        additional coverage guidance. To automate this task and gener-                more advanced approaches identify progress-capturing constraints
                                        alize to diverse programs, Locus features an agentic framework                in the program to drive execution towards satisfying specific tem-
                                        with program analysis tools to synthesize and iteratively refine the          poral orders and preconditions [1, 3, 26, 33, 38, 43, 47, 49, 61, 65].
                                        candidate predicates, while ensuring the predicates strictly relax            However, these constraints are often manually crafted by experts
                                        the target states to prevent false rejections via symbolic execution.         and tailored to a few specific target state types, e.g., focusing on
                                        Our evaluation shows that Locus substantially improves the effi-              a temporal memory safety bug like a use-after-free by enforcing
                                        ciency of eight state-of-the-art fuzzers in discovering real-world            an allocate–free–use sequence. As different programs can have di-
                                        vulnerabilities, achieving an average speedup of 41.6×. So far, Lo-           verse target states and disparate functionalities, the feedback metric
                                        cus has found eight previously unpatched bugs, with one already               effective in one case may not generalize to another [14].
                                        acknowledged with a draft patch.                                                  As Machine Learning (ML) and Large Language Models (LLMs)
                                                                                                                      have demonstrated surprising code reasoning capabilities, there
                                                                                                                      has been a growing interest in extending such capabilities to help
                                                                                                                      guide fuzzing [66]. A common approach is to employ LLMs to
                                        1   Introduction                                                              directly generate inputs or grammar-aware input generators (i.e.,
                                        Directed Greybox Fuzzing (DGF) aims to search for program inputs              fuzz driver or harness) [10, 37, 52, 53, 58, 77, 96, 101], where the
                                        leading its execution to reach specific target program states, e.g., an       preconditions useful to reach target states are expressed as part
                                        index variable used to access an array is out of the array bounds,            of the input grammar constraints. However, not all conditions to
                                        thereby uncovering potential bugs or vulnerabilities. It is widely            reach target states can be easily represented as input grammars. For
                                        used in software engineering and security applications, including             example, Figure 1 shows that an intermediate state (!found_plte)
                                        debugging system crashes [81, 85], testing patches [4, 60], verifying         necessary to trigger the buffer overflow in libpng only emerges
                                        bug reports from static analysis [15, 63], and generating Proof-of-           in the middle of the execution and cannot be easily checked at the
                                        Concept (PoC) exploits of vulnerabilities [5, 52]. While the original         input level. More importantly, such a task formulation is particularly
                                        purpose of directed fuzzing is largely not for discovering new bugs,          challenging for LLMs, as reasoning from the target program states all
                                        i.e., it needs a specified target, it has found numerous impactful            the way to the input often requires an extremely long context and
                                        zero-day vulnerabilities [8, 49, 64, 74, 86].                                 convoluted reasoning chain [21, 39, 44, 45, 48, 70, 71, 71], a common
                                                                                                                  1
pitfall for LLM hallucination [72, 98], let alone the challenge to            bugs. We have responsibly reported all the bugs, and one of the
verify the correctness of the LLM generated inputs and harnesses,             maintainers has already acknowledged it with a drafted patch.
which can significantly impede the fuzzing progress if the generated
constraints are incorrect [38, 65].                                           2     Overview
Our approach We present Locus, a new framework that integrates                In this section, we first describe the background of directed fuzzing.
LLMs’ code reasoning capabilities for directed fuzzing by synthesiz-          We then contrast our idea with the existing approaches to demon-
ing semantically meaningful and verifiable predicates to guide the            strate how Locus complements the existing design (Figure 1).
search. Unlike existing LLM-based approaches that focus on con-
straining the search space at the input (harness) level, which poses          2.1     Directed Fuzzing
a high reasoning burden on LLMs and can be hard to verify, Locus
                                                                              Canary Directed fuzzing aims to generate inputs that drive the
generalizes the constraint generation to be at arbitrary program
                                                                              program execution to reach predefined program states. In this pa-
points. Specifically, given a target state to reach, Locus automates
                                                                              per, we rely on the concept canaries [32] to explicitly represent the
the reasoning about the intermediate program states that verifiably
                                                                              target states, which are considered reached when the correspond-
relax the target states and synthesizes predicates as a curriculum
                                                                              ing canary condition is satisfied. Early works like AFLGo [6] and
to capture the gradual progress towards reaching them. Such predi-
                                                                              Beacon [35] treat canaries primarily as the reachability to particu-
cates serve as the preconditions dominating all executions to reach
                                                                              lar program points, i.e., specific line numbers in a code file. Recent
the target states, providing fine-grained progress feedback to guide
                                                                              works [32, 88] adopt assertion-based canaries, where predicates are
the fuzzers for input scheduling and early termination [35, 56, 78].
                                                                              explicitly inserted to check vulnerability-triggering conditions at
As these predicates are implemented as source-level instrumen-
                                                                              runtime. An example of a canary is highlighted in the bottom-left
tation, they are agnostic to any fuzzer implementations and can
                                                                              code snippet of Figure 1.
thus be integrated without any customization. Moreover, as the
                                                                                 Canaries for directed fuzzing originate from various sources,
instrumentation is a one-time offline process, the cost of running
                                                                              including static analysis alerts, manually identified vulnerability
Locus can be amortized in all the succeeding fuzzing procedures.
                                                                              sites, or runtime sanitizers. For example, address sanitizers [73] can
Agentic design Automating the generation of effective progress-               also be approximately viewed as canaries checking memory safety
capturing predicates for diverse target states involves nontrivial            violations, e.g., inserting canary(index > maxbound) to detect
reasoning across the entire codebase, spanning multiple functions             out-of-bound accesses.
and their associated data and control flows. Simple prompting tech-
                                                                              Common strategies to reach canaries Previous research on
niques, such as in-context learning, chain-of-thought (CoT) prompt-
                                                                              directed fuzzing can be broadly grouped into four main strategies:
ing, retrieval-augmented generation, etc., can hardly support so-
phisticated reasoning at the repository level. To this end, Locus                 • Distance-guided scheduling. This approach approximates the
features an agentic synthesizer-validator workflow, equipped with                   distance from the currently covered regions of the CFG to the
diverse program analysis tools to support the traversal of the control              canary, and then prioritizes seeds that are more likely to drive
flow graph, tracking data dependencies, retrieving function calls,                  execution along paths that reduce this distance [6, 11, 24]. For
and symbolic execution. They serve as the agent’s action space,                     paths that are unlikely to reach the canary, the execution can
allowing the LLM to reason via CoT and act by calling them [97] to                  be early terminated [35, 56, 78].
iteratively propose, refine, and validate candidate predicates.                   • Specialized progress-capturing state representations. Some
   Importantly, by constraining the output space of the agent at the                approaches introduce domain-specific abstractions tailored for
predicate-level, Locus ensures any (inevitable) LLM errors are cor-                 certain classes of vulnerabilities [43, 61]. For example, tracking
rected before deployment for fuzzing. Specifically, Locus’s output                  dataflow from memory allocation to deallocation events for
is validated by both the compiler and symbolic execution [7]. The                   the use-after-free vulnerability.
former checks the syntactic correctness, while the latter ensures the             • LLM-assisted harness generation. A more recent direction
generated predicates strictly relax the target states, i.e., by check-              leverages LLMs to synthesize input generators to manipulate
ing whether there exists a path that violates the predicates while                  raw program inputs [54, 90, 96, 102]. These methods attempt
satisfying the target states. Such a strict relaxation ensures that the             to constrain the valid execution at the input level by leveraging
fuzzing execution can be safely early-terminated if it violates the                 the learned input grammar knowledge in the LLMs.
predicate. Figure 2 shows the Locus’s workflow.
                                                                              2.2     Motivating Example
Results We evaluate Locus on the Magma benchmark [32] with
                                                                              We use a real-world vulnerability CVE-2013-6954 in libpng, a
eight widely used libraries and 10 vulnerability types across eight
                                                                              widely used C library for parsing PNG files, to demonstrate how
state-of-the-art fuzzers, covering both directed and undirected ones.
                                                                              Locus complements existing approaches.
Locus achieves a significant 70.3× speedup on average for directed
                                                                                 A simplified CFG of this vulnerability is shown in Figure 1a,
fuzzers, with up to 214.2× speedup when integrated to accelerate
                                                                              where all the nodes highlighted in gray are equidistant to the ca-
SelectFuzz [56], one of the state-of-the-art directed fuzzers. For
                                                                              nary node (highlighted in yellow), but only one node PLTE is in
coverage-guided fuzzers, Locus accelerates them by 13× on aver-
                                                                              the path that can lead to the vulnerability (annotated as green ar-
age, including 15.3× speedup for the extensively optimized fuzzer
                                                                              rows). Based on this, fuzzers can only perform conservative pruning
like AFL++. So far, Locus has found eight previously unpatched
                                                                              (annotated as dashed arrows) while omitting a large part of the
                                                                          2
                                          Input Generator                        LLVMTestOneInput                            png_read_info
                                                                            png_read_info(png_ptr, ...);
                int found_plte = 0;
                        ...
                                       Generic Input Grammar                ... // processing
                           for (;;) {
                 PLTE
                                      write_signature(png_ptr, 13);
        png_write_png(png_ptr, ...)                     ...
if (!found_plte) EXIT();
                                      png_write_frame_head(p, ...)
         if (png->color != PALETTE) EXIT();
                ...
...
(a) A simplified CFG (b) LLM-generated code (c) Locus synthesizes predicates in relevant functions.
Figure 1: A motivating example (CVE-2013-6954) showing how Locus complements existing works. (a) Traditional approaches
based on distance to targets in CFG lack fine-grained guidance to distinguish nodes when they have the same distance. (b)
LLM-based harness generation is limited to help reach the target. (c) Predicates (as if statements) synthesized by Locus provide
extra semantic guidance for DGFs, while relaxing the constraint generation from input-level to arbitrary program points.
paths that are irrelevant to the vulnerability, e.g., all solid black ar-        chunk, which serves as the necessary state before the target state
rows. Capturing the progress towards reaching this canary requires               can be reached, i.e., the palette size exceeds the expected bounds.
specialized knowledge about libpng parsing logic. However, de-                       Existing fuzzers struggle with such a vulnerability due to the
veloping such specialized progress-capturing state representations               complex parsing logic. Since PNG chunks may appear in arbitrary
requires manual efforts from experts and cannot easily generalize                order and many are optional, the primary parsing routine imple-
to different vulnerabilities and programs.                                       ments a loop in png_read_info to iteratively process each chunk.
   Figure 1b shows an example of the input generator synthesized                 Within this loop, multiple branches exist, and each is responsi-
by LLM-based approaches that enforced some specific input con-                   ble for handling a specific chunk type, e.g., IHDR, IDAT, PLTE. This
straints. However, this approach is inherently limited as program                loop makes it possible to identify whether a given PNG file con-
inputs can have sophisticated structures that cannot be easily en-               tains a PLTE chunk. However, since these branches are syntactically
forced at the input level, e.g., the PNG file contains a compressed              parallel and executed without a fixed order, they all appear equidis-
data chunk IDAT. Therefore, the constraints posed on the input                   tant from the vulnerability site (png_set_PLTE) in the control flow
generated by the LLM often reduce to generic input grammars, e.g.,               graph. As a result, the naive path distances cannot distinguish be-
png_sig_cmp only trivially ensures the input is a valid PNG file. We             tween them to prioritize one path over the other. This explains why,
also note that the synthesized generators cannot be easily checked               in Table 2 and Table 3, all existing fuzzers take a substantial amount
to determine whether they can effectively help reach the canary.                 of time to reach this target state in libpng.
For example, png_write_frame_head constrains the generated in-                       Locus starts with generating a predicate to check whether the
put to a special PNG type APNG that is essentially impossible to                 input PNG file contains a PLTE chunk at the caller of the canary
reach the canary, but automatically checking this fact is challeng-              function ( 1 ). This implies that if the PNG file does not contain
ing. Moreover, to constrain the input to satisfy certain properties              a PLTE chunk, the execution can terminate, as it is impossible to
necessary to reach the target, the generator sometimes needs to                  reach the target state. While this predicate is semantically correct,
repeat the input processing logic presented in the execution path.               i.e., the symbolic execution confirms there is no feasible path to
Such repeated execution introduces additional overhead.                          satisfy the target state while violate the generated predicate (Sec-
   Locus synthesizes progress-capturing predicates at arbitrary pro-             tion 3.4), and can help filter out non-palette-based PNG files, it is
gram points to provide more fine-grained guidance and complement                 almost redundant as there is an existing one immediately after the
the above approaches. Specifically, Locus generates a predicate in               generated check (if (color_type & MASK_PALETTE)), and thus
png_read_info (shown in Figure 1c), representing the precondition                the generated predicate could barely help the fuzzer.
to trigger a real-world buffer overflow vulnerability and terminate                  To address this issue, Locus includes another refinement itera-
the execution if it is not satisfied (if(!found_plte)). The target               tion (Algorithm 1) to propagate this predicate to a location closer
state is highlighted in a canary statement. Locus’s trajectory re-               to the program entry such that the infeasible execution can be
veals that it relies on the semantic reasoning of libpng’s parsing               terminated earlier. By traversing the call graph and retrieving the
behaviors for generating this semantically meaningful predicate.                 functions along the call chain, the refinement proposes a new loca-
   Specifically, PNG files consist of a series of structured chunks,             tion in function png_read_info right after the parsing loop ( 2 ).
many of which are optional and can appear in varying orders. One                 Such a propagated predicates will also be validated again to ensure
such optional chunk, PLTE, stores the palette data used for indexed-             it remains a strict precondition to reach the target states. At the next
color images and must adhere to strict size constraints. In this                 iteration of refinement, Locus proposes to propagate this predicate
example, an input PNG file can trigger a buffer overflow in the                  further inside the branch that parses the IDAT chunk ( 3 ). That is
png_set_PLTE function only when the PNG file contains a PLTE                     because the specification of the PNG file requires that the optional
                                                                             3
                                                                 Locus ( Synthesis                                             Validation )
           Program                                                            Localization                                                                        Generation
      ├─   README.md
                                                                                                      refined
      ├─   bar.c
                The program input should                            tools/...
                                                                diff a/bar.c b/bar.c
      ├─   utils
                look like a special ...                             bar.c:
                                                                   index 2e34.. 1044
      │     ├─ foo.c
                                              Localize            read_file
                initial                                       @@-6 +8@@ get(ptr *p,
                                                                                     utils/foo.c:
                                                             + if (a > b)
iter_by dst[idx] = *p + 1;
Crashing Input Predicates Validation Pruned Control Flow Graph Control Flow Graph Build Report
Figure 2: Overview of Locus workflow. Locus takes as inputs the program codebase 𝑃 and the canary 𝜓 , and produces a program
𝑃 ′ instrumented with the progress-capturing predicates. The predicate branches provide extra coverage feedback and guards
(via early termination) to guide the fuzzer toward reaching the target state, i.e., canary 𝜓 , more efficiently.
PLTE chunk must appear before the IDAT chunk. Therefore, by the                                      Definition 1. For a vulnerability 𝑣, the program 𝑃 ′ is fuzzing
time the parsing procedure reaches the IDAT chunk, we can already                                  admissible to 𝑃, iff ∀𝑥 ∈ 𝑋, 𝑃 ′ ⇓𝑥 𝑆 𝑣 =⇒ 𝑃 ⇓𝑥 𝑆 𝑣 .
determine whether the input PNG file contains a PLTE chunk.                                           While it is unlikely to ensure 𝑃 ′ is fuzzing admissible to 𝑃 in
   Such a finalized predicate benefits all fuzzers, as it can prioritize                           general without a pre-defined target vulnerability 𝑣, we show such a
inputs with PLTE properties early in the execution and save the                                    property is well-defined when 𝑣 is given and explicitly represented
fuzzers from considering PLTE-irrelevant PNG images. With such a                                   by the canary 𝜓 .
predicate, AFLGo [6] gained an impressive 8× speedup in triggering                                    A program predicate 𝜙 : 𝑠 → {True, False} is a boolean mapping
this vulnerability with only a three-line change in png_read_info.                                 over the program state space. Concretely, any conditions inside the
                                                                                                   branch statements, e.g., if, while, or assert, can be regarded as
3     Methodology                                                                                  predicates, as they evaluate a Boolean expression over the program
Figure 2 illustrates the high-level workflow of Locus. Given a                                     state. We define a special class of predicates, namely canaries, to
specified canary 𝜓 and the program codebase 𝑃, Locus outputs a                                     characterize vulnerable states:
new codebase 𝑃 ′ instrumented by a set of predicates Φ, where each
                                                                                                       Definition 2. A vulnerability canary 𝜓 is a predicate s.t. ∀𝑠 ∈
𝜙 ∈ Φ is represented by a branch condition with early exit if the
                                                                                                   𝑆 𝑣 ⇔ 𝜓 (𝑠) = True.
predicate condition is not satisfied. The synthesizer is responsible
for generating candidate predicates Φ, while the validator ensures                                    The goal of directed fuzzing towards 𝑆 𝑣 is equivalent to finding
that Φ are both syntactically valid and semantically consistent with                               inputs that satisfy the canary𝜓 . To provide semantically meaningful
the 𝜓 , i.e., relaxing the canary conditions. The fuzzer will run on                               guides to directed fuzzing, we may instrument the program 𝑃 with
𝑃 ′ to receive more progress feedback to the canary while enjoying                                 an additional predicate 𝜙. Such instrumentation is admissible if and
the early termination. In the following, we formalize the task and                                 only if 𝜙 is a relaxation of the true vulnerability canary 𝜓 :
then elaborate on each design component in Locus.                                                     Definition 3. A predicate 𝜙 is the relaxation of canary 𝜓 , if
                                                                                                   ∀𝑠 ∈ 𝑆 𝑣 ,𝜓 (𝑠) = True =⇒ 𝜙 (𝑠) = True.
3.1     Formalization                                                                                 By definition, 𝑃 ′ instrumented by Φ is fuzzing admissible if every
We use the notation 𝑃 ⇓𝑥 𝑆 to indicate that the program 𝑃, when                                    predicate 𝜙 is a relaxation of 𝜓 . This suggests that fuzzing 𝑃 ′ is
executed with input 𝑥 ∈ 𝑋 sampled from 𝑃’s input space 𝑋 , can                                     equivalent to fuzzing 𝑃 while enjoying the additional guidance and
reach a set of program states 𝑆. Assume triggering a vulnerability 𝑣                               early termination introduced by the instrumented predicates.
can be characterized as reaching a set of program states 𝑆 𝑣 . Given
the program under fuzz (𝑃 ′ ) instrumented by Φ, we need to make                                      Theorem 1. The instrumented program 𝑃 ′ is fuzzing admissible
sure that the early termination introduced in Φ preserves the same                                 to 𝑃, if 𝑃 ′ is instrumented with Φ, where every 𝜙 ∈ Φ is the relaxation
fuzzing behavior on 𝑃 ′ as that of 𝑃, i.e., Φ do not reject any 𝑥 ∈ 𝑋                              of 𝜓 .
that would have reached 𝑆 𝑣 in 𝑃. To this end, we formally define                                     We next illustrate how our agentic synthesizer-validator work-
fuzzing admissibility.                                                                             flow can produce an admissible instrumented program 𝑃 ′ which
                                                                                             4
                                                    Table 1: Toolset for synthesizer agent
helps provide rich semantic feedback for the directed fuzzer to reach           Algorithm 1 Scaffold of Locus’s agentic workflow
vulnerable states 𝑆 𝑣 .                                                         Require: original program 𝑃, vulnerability canary 𝜓
                                                                                Ensure: an target-conditional equivalent program 𝑃 ′
3.2     Agent Toolset                                                            1: C ← CanaryReasoning(𝑃,𝜓 )              ⊲ list of reasoning tokens
Existing agentic practices for software analyses, such as bug fixing             2: Φ ← ∅ d
or fault localization, are often equipped with lightweight command- 3: for all 𝑐 ∈ C do
line tools to perform local reasoning [67, 95, 105]. This is effective           4:     𝑙 ← Localize(𝑐, 𝑃)          ⊲ find the initial program point
because the root cause of a bug is often spatially close to the observ-          5:     𝑛←0
able failure, allowing the agent to retrieve relevant context from               6:     repeat
a narrow portion of the entire codebase. However, in our task for-               7:         𝜙𝑙 ← Generate(𝑙, 𝑐, 𝑃)
mulation, predicates can be synthesized at any program point. As a               8:         while ¬validate(𝜙𝑙 ,𝜓, 𝑃) do ⊲ syntax and semantic
result, local reasoning is insufficient, and the agent needs additional          9:             𝜙𝑙 ← Generate(𝑙, 𝜙, 𝑃)
tools to support analysis of a broader code context.                            10:         end while
    To equip the agent with long-context code reasoning at arbitrary            11:         𝑙 ← Localize(𝜙, 𝑙, 𝑐, 𝑃) ⊲ refine, find a better location
program points, we provide a suite of tools that Locus can invoke               12:         𝑛 ←𝑛+1
to navigate the entire codebase and retrieve relevant code snippets.            13:     until 𝑙 = None ∨ 𝑛 > MaxIterations
Particularly, in addition to common tools like code search and file             14:     Φ ← Φ ∪ {𝜙𝑙 }
listing, Locus integrates specialized tools for traversing program              15: end for
graphs, including call graphs and reference graphs. Through the                 16: 𝑃 ′ ← Instrument(𝑃, Φ)            ⊲ fuzzing admissible program
call graph APIs, the synthesizer agent can retrieve function call
relationships and reason the interprocedural control flow, while the
reference graph API allows it to identify variable usages, pointer
                                                                                (lines 4–13), which are then refined and validated for both syntax
dereferences, and data access patterns across the codebase. Table 1
                                                                                and semantics (line 8). Once validated, the predicates will instru-
shows the complete toolset and the ratio of their actual usage in
                                                                                ment the original program to produce a fuzzing-admissible program
our experiments. Among them, graph traversal accounts for over a
                                                                                (line 16).
quarter of all API invocations.
                                                                                    As we demonstrate in Theorem 1, a predicate 𝜙 should be in-
    It is important to note that these graphs are only provided as sup-
                                                                                strumented at the execution path that can reach the canary 𝜓 . To
plementary references to Locus, as they are all derived statically and
                                                                                synthesize 𝜙, LLMs must reason about the root cause of the vulner-
may miss program behaviors such as dynamic dispatches and indi-
                                                                                ability and approximate potential execution traces of the program
rect calls. We observe that the LLM used in Locus is capable of lever-
                                                                                that can trigger it. The execution trace of a program is often long
aging its learned knowledge to bridge this gap. For example, in the
                                                                                and complex, involving multiple functions and files. Therefore, a
case of vulnerability TIF002 (see Table 2 and Table 3), Locus success-
                                                                                naive one-shot synthesis approach may not be sufficient, as the
fully resolved the indirect function pointer tif->tif_decoderow
                                                                                synthesizer only retrieves limited context and proposes primitive
to its concrete implementation PixarLogDecode.
                                                                                predicates, e.g., an initial predicate generated by Locus ( 2 in Fig-
                                                                                ure 1c). To address this, Locus employs an iterative localization-
3.3     Synthesis                                                               generation refinement workflow. In each iteration, the synthesizer
Algorithm 1 elaborates on the workflow of Locus. For each candi-                generates a predicate that preserves the same semantic meaning
date predicate, it iteratively localizes and generates the predicates           while moving its placement closer to the program entry.
                                                                            5
   Locus first synthesizes an initial set of predicates by analyzing             Semantic validation The second component is semantic vali-
the canary and approximates a list of semantic characteristics of                dation, which confirms that the predicate 𝜙 strictly relaxes 𝜓 . It
the inputs that are likely related to this canary, e.g., data struc-             ensures that the predicate does not reject any execution paths that
tures, types, and properties. The synthesizer is then prompted to                could reach the vulnerability (Theorem 1). As we demonstrated
consider these constraints from multiple dimensions. For example,                in Definition 3, such validation requires enumerating all possible
in Figure 1, besides the predicate, the synthesizer generates mul-               program inputs, which cannot be done within an acceptable time
tiple constraints, such as requiring that the input PNG file must                budget. Therefore, we utilize symbolic execution to find counterex-
contain a valid signature. These approximations are progressively                amples of the relaxation.
concretized and refined as the synthesizer retrieves more relevant
code and reasons about the program execution.                                      Theorem 2. A predicate 𝜙 is not a relaxation of 𝜙 ′ , if there exists
   For each approximated characteristic, the synthesizer needs to                𝑃 ⇓𝑥 𝑠, s.t. 𝜙 (𝑠) = False ∧ 𝜙 ′ (𝑠) = True
identify an appropriate program point 𝑙 where the predicate can
be expressed in terms of the variables and expressions in scope.                     Specifically, the predicate is not a strict relaxation of the canary
Constraining LLM to directly identify the exact program point is                 if the symbolic execution can find a path that satisfies the negated
often too challenging and unreliable, so the initial stage only asks             predicate ¬𝜙 while the canary 𝜓 still evaluates to true.
the synthesizer to select a candidate function rather than a precise                 It is natural to employ symbolic execution as a formal checker for
program point. Given the reasoning context generated by the LLM                  Theorem 2 to ensure fuzzing admissibility. However, it often incurs
so far, the synthesizer attempts to generate an initial predicate                substantial overhead by exploring irrelevant execution paths [7],
𝜙𝑙 in the candidate function. Once 𝜙𝑙 passes the validation stage                e.g., branches that are unrelated to either the synthesized predicate
(Section 3.4), we refine it in the following iterations. The goal of             or the target canary. This excessive path exploration can lead to
such a refinement is to identify an earlier program point where the              prohibitively long validation time, incurring additional overhead
same semantic check can be performed. This allows the program                    of deploying Locus. To mitigate this inefficiency, we adopt a strat-
to reject invalid inputs sooner in the execution, enabling the fuzzer            egy inspired by Chopper [82] that skips specified parts of the code
to explore more valid mutations within the same time budget.                     and targets only the exploration of paths according to our selec-
   It is worth noting that with the refinement iteration, a predicate            tion. Specifically, we perform a lightweight reachability analysis
can be all the way refined toward the fuzzing harness. In some cases,            on the CFG and prune nodes that are not reachable from either the
Locus can indeed effectively synthesize predicates at the program                predicate or the canary. Locus then initiates symbolic execution to
entry, making it similar in spirit to automated harness generation               explore the path between the ¬𝜙 and 𝜓 .
works such as HGFuzzer [90] and InputBlaster [54]. However, in
most cases, the input constraints cannot be directly accessed at the
                                                                                 4     Evaluation
harness level. For example, in Section 2.2, verifying the presence
and content of a PLTE chunk requires parsing internal structures                 We aim to answer the following research questions:
of the input that are only accessible deeper in the program. This                 RQ1 Effectiveness: How effective is Locus in accelerating the
necessitates placing predicates at intermediate program points.                       generation of PoC inputs for given target vulnerabilities?
                                                                                  RQ2 Cost and performance: What is the time cost and token
3.4    Validation                                                                     cost for deploying Locus?
The validation step is critical to preserving the correctness of the              RQ3 Ablations: How do the individual components of Locus
instrumented program and ensuring that the inserted predicate 𝜙𝑙                      contribute to its overall performance?
maintains the fuzzing admissibility. As shown in Algorithm 1 lines                RQ4 Security impact: How can Locus assist in real-world vul-
7 to 9, the validator takes as input the candidate predicate 𝜙, the                   nerability detection scenarios?
vulnerability canary 𝜓 , and the program 𝑃, and validates whether
the predicate is both syntactically and semantically correct. If the             4.1    Setup
predicate passes both checks, Locus will refine the predicate by
exploring potentially better program points closer to the program                Dataset We evaluate Locus on the Magma fuzzing benchmark [32],
entries (Section 3.3). If it fails, Locus will self-reflect and attempt to       which includes a diverse set of real-world vulnerabilities selected
regenerate the predicate, using diagnostic feedback collected from               from nine popular open-source software projects. We also consider
the validator.                                                                   popular libraries in the wild, e.g., libarchive, to evaluate the capabil-
                                                                                 ities of Locus in finding real-world vulnerabilities (see Section 4.6).
Syntax validation The first component of validation is syntactic
checking. A predicate that fails to compile cannot be used in a                  Baselines We select eight fuzzers (see Table 4) as our baseline,
fuzzing campaign, regardless of its intended semantics. To verify                covering both the state-of-the-art directed and coverage-guided
this, the program is instrumented by the predicate 𝜙 at the desig-               ones. For each fuzzer, we use the latest stable version available at
nated program point 𝑙 and invokes the project’s build system using               the time of writing.
a predefined command. If the build fails, the associated compiler                   For directed fuzzers, we consider AFLGo [6], SelectFuzz [56],
error messages will be sent to the synthesizer to repair. This di-               Beacon [35], and Titan [36]. We do not include a comprehensive
agnostic information typically involves undeclared symbols, type                 set of directed fuzzers due to the lack of publicly available imple-
mismatches, or malformed expressions.                                            mentations and the challenges of integrating them into Magma.
                                                                             6
Table 2: TTE for each vulnerability in the Magma benchmark across different directed fuzzers. T.O. indicates that the fuzzer
cannot find the vulnerability within 24 hours. ∅ indicates that the fuzzer either could not compile the target program or the
preprocessing step exceeded 24 hours.
    We also consider coverage-guided fuzzers, including the widely           fuzzing trial is capped at 24 hours, so those that fail to find the
used AFL [99] and AFL++ [27]. To make the evaluation more com-               triggering input are recorded at this maximum duration. This po-
prehensive, we include two additional fuzzers. MOPT [57] enhances            tentially underestimates the actual TTE for baseline fuzzers, but it
fuzzing with optimized mutation strategies, and Fox [75] optimizes           offers a lower-bound estimation. Therefore, the actual improvement
the seed scheduling as online stochastic control.                            brought by Locus can be even larger. To eliminate hardware and
    Note that Locus alone is not a standalone fuzzer, but focuses ex-        system-related discrepancies, all experiments are conducted on a
clusively on code transformations for the target software. Therefore,        dedicated cluster, where each server comes with an Intel Xeon Gold
it is agnostic to the fuzzer implementations and can complement              6126 CPU and 128GB of RAM running Ubuntu 20.04.
any fuzzers (Figure 2).                                                         We implement Locus using the PydanticAI framework [16]. The
                                                                             synthesizer’s tooling is primarily built on top of Multiplier [29].
Metrics We follow the existing directed fuzzing approaches by
                                                                             We leverage the SVF static analysis framework [80] to perform a
adopting the Time To Exposure (TTE) to measure the performance
                                                                             lightweight reachability analysis and use KLEE [7] as the symbolic
of the baseline fuzzers and the improvement introduced by inte-
                                                                             execution engine to perform semantic validation. We set the LLM
grating Locus. TTE measures the time taken by a fuzzer to find the
                                                                             generation temperature to zero to avoid non-determinism. We use
input that satisfies the canary condition. To mitigate the inherent
                                                                             o3-mini-2025-01-31 as the default model with reasoning level
randomness introduced by fuzzing, we follow [32] by executing 10
                                                                             to medium, but we also evaluate other LLMs in Section 4.4.
independent fuzzing trials per vulnerability sample and report the
average TTE. We also employ the Mann-Whitney U Test [59] to
demonstrate the statistical significance (𝑝-value) of the results.              4.2   RQ1: Vulnerabilities Reproduction
                                                                             We apply Locus on both directed fuzzers and coverage-guided
Implementations We run all the fuzzers with the same initial
                                                                             fuzzers, and compare the performance w/ and w/o Locus. The
seed inputs provided in Magma to ensure a fair comparison. Each
                                                                             results are shown in Table 2 and Table 3.
                                                                         7
Table 3: TTE for each vulnerability in the Magma benchmark across different coverage-guided fuzzers. T.O. indicates that the
fuzzer cannot find the vulnerability within 24 hours. ∅ indicates that the fuzzer either could not compile the target program or
the preprocessing step exceeded 24 hours.
  Table 4: Overview of selected fuzzers in the evaluation                           For coverage-guided fuzzers, integrating Locus also yields sub-
                                                                                 stantial gains: AFL++ [27], AFL [99], MOPT [57], and FOX [75]
 Fuzzer             Category           Description                               achieve 15.3×, 20.4×, 5.5×, and 10.6× faster TTE on average, with
                                                                                 four more vulnerabilities found for AFL than those without Locus
 AFLGo [6]       directed              Distance-based seeds scheduling
 SelectFuzz [56] directed              Selective path exploration                within the 24-hour time window.
 Beacon [35]     directed              Fuzzer with efficient path pruning
 Titan [36]      directed              Targets correlations inference
 AFL [99]           coverage-guided    Evolutionary mutation strategies             4.3   RQ2: Cost Analysis
 AFL++ [27]         coverage-guided    Community-enhanced AFL
 MOPT [57]          coverage-guided    Fuzzer with Swarm Optimization            We measure the time cost introduced by Locus and the token cost
 Fox [75]           coverage-guided    Online stochastic control                 incurred by LLM inference.
                                                                                 Time cost In Section 4.2, we focused on measuring TTE, i.e., the
                                                                                 time required to trigger the target canary during fuzzing. However,
                                                                                 only focusing on the fuzzing phase can be misleading when evalu-
   In summary, Locus consistently helps all kinds of fuzzers for                 ating the overall effectiveness of a directed fuzzer. Many directed
vulnerability reproduction. When integrating Locus to directed                   fuzzers perform expensive static analysis on the target program
fuzzers, AFLGo [6], SelectFuzz [56], Beacon [35], and Titan [36]                 before starting fuzzing [6, 36, 37]. Likewise, Locus requires addi-
achieve 17.0×, 214.2×, 22.1×, and 28.0× faster TTE on average, with              tional preprocessing steps, including codebase indexing, symbolic
five more vulnerabilities found for AFLGo and one more vulnera-                  validation of predicates, and agentic predicate synthesis. While
bility found for SelectFuzz than those without Locus within the                  such preprocessing costs are often arguably a one-time effort that
fixed 24-hour time window.                                                       can be amortized during fuzzing, in practice, we observe that some
                                                                             8
Table 5: Average deploy cost for Locus and directed fuzzers.                    Table 6: TTEs by ablating different designs and models.
All times are calculated in seconds. T.O. indicates that the
fuzzer failed to instrument the target library within 24 hours.                               PNG007       SND001    TIF012    TIF014   SQL018
                                                                               Abalate different designs
 Target        PNG   SND    TIF    LUA   XML    SSL     PHP    SQL
                                                                               AFL++
 Size (LoC)    95k   83k    95k    21k   320k   630k    1.6M   387k
                                                                               origin         53104        451       1731      2555     10355
 Index         11    34     82     9     76     146     244    137             +base          54830        389       1904      1226     11573
 Synthesis     373   331    212    178   215    384     412    349             +refine        T.O.         23        4417      33931    54268
 Validation    261   133    231    280   475    824     353    407
                                                                               +valid         41101        19        1122      682      4961
 Total         645   498    525    467   766    1354    1009   893
 #Tokens (k)   309   303    256    176   653    598     894    467
                                                                               SelectFuzz
                                                                               origin         72078        7764      9118      58851    53397
 AFLGo         122   673    2689 85      5608   24799 T.O.     15630           +base          79348        2640      8938      12630    48059
 SelectFuzz    84    199    1167 44      807    2597 4554      383             +refine        T.O.         6         10805     T.O.     T.O.
 Beacon        64    113    171 35       1656   T.O.  T.O.     3721
                                                                               +valid         8537         5         4787      7833     6748
 Titan         96    186    967 49       2936   T.O.  T.O.     4965
                                                                               Abalate different LLMs
                                                                               AFL++
approaches incur excessive analysis time, sometimes even longer
                                                                               origin         53104        451       1731      2555     10355
than the time required to discover the bug by the fuzzer.
                                                                               w/ o3-mini     41101        19        1122      682      4961
   To provide a more comprehensive evaluation, we report the de-
                                                                               w/ DR1         45104        53        992       1439     5433
tailed preprocessing time overhead incurred by Locus and baseline
                                                                               w/ Gemini      38215        130       1517      2454     5274
fuzzers. As shown in Table 5, the overhead of the baseline expo-
nentially grows with the size and complexity of the codebase. In               SelectFuzz
contrast, Locus relies only on lightweight analysis tools, e.g., code          origin         72078        7764      9118      58851    53397
retrieval, graph traversal, etc., so it remains efficient regardless of        w/ O3-mini     8537         5         4787      7833     6748
the project size. Particularly, Locus outperforms the best baseline            w/ R1          23699        6         3562      3518     7023
fuzzer SelectFuzz by 4.5× when evaluated on the largest program                w/ Gemini      7020         12        7206      40332    3921
in the Magma benchmark (PHP).
LLM token cost We assess the feasibility of deploying Locus in
practice from a financial perspective, focusing on the monetary cost
of LLM token usage. Table 5 shows the token cost for our Locus                fuzzing, they are not consistently beneficial. Without validation,
workflow. Locus takes 457k tokens (equivalent to $0.72 USD) to                Locus sometimes generates false predicates to the program, as
generate predicates for one sample in the Magma benchmark on                  evidenced by the cases PNG007, TIF014, and SQL018.
average. The monetary token costs show the potential for Locus                Varying models Besides the default o3-mini model, we consider
to act as an affordable step in assisting fuzzing.                            Deepseek R1 [18] and Gemini 2.0 Flash [30] to study the generality
                                                                              of our agentic framework to different LLM architectures. Table 6
4.4    RQ3: Ablations                                                         shows that Locus generalizes to different LLM architectures, except
We ablate the design choice of different modules in Locus and                 for one case where Gemini fails to bring significant improvement
study how they generalize to various LLMs in Table 6. We pick five            to TIF014. We investigate this case and find that Gemini failed to
representative vulnerabilities that fall under different vulnerabil-          elevate the generated predicate to the caller closer to the input,
ity categories in Common Weakness Enumerations (CWEs). The                    such that the additional overhead introduced by evaluating the
evaluation compares two representative mainstream fuzzing tools,              predicates outweighs the benefit it brings to the fuzzing.
AFL++ from the coverage-guided fuzzers and SelectFuzz from the
directed fuzzers.                                                             4.5    RQ4: Detecting New Vulnerabilities
Effectiveness of each design component We ablate the individ-                 We integrate Locus into a real-world vulnerability detection work-
ual design and measure the average TTE obtained by the resulting              flow and apply the pipeline to a set of well-fuzzed targets, such as
predicates. We begin with the base setting, where we only keep                VLC [83], libarchive [50], and libming [51]. To construct meaning-
the initial synthesized predicate (Algorithm 1 line 7). Next, we ap-          ful fuzzing targets, we generate new canaries through two primary
ply the refinement strategy, allowing the synthesized predicate               approaches. First, we leverage alerts produced by the static anal-
to propagate to a better program point without validating its se-             ysis tool SVF [80], which identify potentially vulnerable program
mantic correctness. Finally, we evaluate the full pipeline of Locus           points based on memory access patterns, aliasing behavior, or use-
including both the syntax and semantic validation.                            after-free risks. Second, we derive reachability canaries for program
   Table 6 shows that the design choices in Locus consistently                points associated with previously patched bugs. The rationale be-
improves the performance. Specifically, we observe that while pred-           hind this strategy is that many real-world bugs occur in clusters or
icates generated under the base setting may occasionally accelerate           evolve from incomplete fixes [87].
                                                                          9
Table 7: New vulnerabilities detected fuzzers with Locus. For                     Fixing Patch                                   Canary by LOCUS
each newly found vulnerability, we run AFL++ and Select-                      --- a/src/expr.c
                              if( pCollName->n > 1 &&
Fuzz w/ and w/o Locus. We exclude the detailed software                       +++ b/src/expr.c
                               (pCollName->z[0]=='"' || 
    if (code == FORMAT_RAR_V5 || code == FORMAT_RAR){                           enables the LLM to operate within the (relatively) local context,
                                                                                allowing Locus to detect certain code behaviors that only emerge
          Check the delta filter                                                midway through execution to inform subsequent input searches,
    uint64_t fp= filter->prog->fingerprint;
                                                                                verification [9, 13, 42, 55, 92]. Locus resembles these approaches in
    +         return 0;
                                                        the sense that they also integrate LLMs with rule-based verifiers,
            lastbyte = dst[idx] = lastbyte - *src++;
                           such as theorem provers or SMT solvers, to check the validity of
         }
14