[ICML2026] Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

CUDAnalyst (CUDA + Analyst) is an analysis framework for studying how self-evolving LLM agents make planning decisions during CUDA kernel generation. Based on CUDAnalyst, we perform generation-level feedback interventions in self-evolving LLM agents for CUDA kernels, attributing contributions of feedback signals to plan decisions.

Motivation

Rather than treating agent execution as a monolithic end-to-end process, where trajectory drift makes controlled attribution difficult, CUDAnalyst decouples feedback acquisition from planning generation, enabling controlled intervention and generation-level attribution across iterative optimization trajectories.

Introduction

The framework supports principled evaluation of heterogeneous feedback signals, including correctness (Debugger), static analysis (Analyzer) , and runtime performace (Profiler), and measures how each signal substrate influences subsequent planning behavior (PlanAgent).

To support scalable experimentation, CUDAnalyst adopts a sample-centric, event-driven execution architecture with concurrent pipelined evaluation, namely IntervenePipe.

Installation

git clone https://github.com/yuxuan-z19/cudanalyst.git
cd cudanalyst
# install PyTorch with appropriate CUDA backend
uv pip install torch --torch-backend=cu124
# for simple usage
uv pip install -e .
# for development
uv sync --dev

Ensure you have installed Nsight Compute >= 2025.2.1 and added the absolute path of its extras/python directory to the $PYTHONPATH environment variable.

❯ export PYTHONPATH="/data/zyx/local/nsight-compute-2025.2.1/extras/python:$PYTHONPATH"
❯ python -c "import ncu_report;"
❯ echo $?
0

Usage

This project supports three main workflows:

OpenEvolve-based evolution
LLM4AD-based evolution
Generation-level Intervention (multi-rollout intervention on evolution trajectories)

The recommended order is: Run evolution $\rightarrow$ Group by generation $\rightarrow$ Perform intervention $\rightarrow$ Compute statistics

OpenEvolve evolution

Install the modified OpenEvolve fork:

git submodule update --init --recursive
cd ./eval/openevolve && uv pip install -e .

Example using the NPB-GPU CG workload is provided in ./eval/demo/oe-npb.bash:

SUITE_ROOT="benchmark/hpc/npb"
WORKLOAD="CG"
WORKLOAD_DIR=$SUITE_ROOT/src/$WORKLOAD

python eval/openevolve/openevolve-run.py \
    $WORKLOAD_DIR/sol.init.cu \
    $SUITE_ROOT/eval.py \
    -c eval/config/openevolve.yml \
    -o out/npb-$WORKLOAD \
    -p $WORKLOAD_DIR \
    -a config/cudanalyst_template.yml \

Compared to the original OpenEvolve, two new options are introduced:

-p: Specifies the subtask evaluation directory.
-a: Specifies the CUDAnalyst config file path. Default: None

These two options are passed to the task-specific evaluate() as arguments.

After execution, the output directory will look like:

out/npb-CG/
├── best/
├── checkpoints/
└── logs/

The checkpoints/ directory will be used in the generation-level intervention stage.

LLM4AD evolution

Install the modified LLM4AD fork:

git submodule update --init --recursive
cd ./eval/llm4ad
uv pip install -r requirements.txt
uv pip install -e .

For usage, please refer to LLM4AD docs. We provide a example on PolyBench-ACC 3MM with EoH algorithm in ./eval/demo/llm4ad_eoh.py.

After execution, the output directory will look like:

out/eoh-3MM/
├── population/
├── run_log.txt
└── samples/
    ├── samples_1~200.json
    └── samples_best.json

The samples_1~200.json file which keeps all the kernels generated will be used in the generation-level stage.

Generation-level Intervention

This workflow enables:

Grouping evolution results by generation
Performing multi-rollout intervention on each generation
Computing pass@$k$-like statistics

See demo.py for a minimal example.

Group evolution outputs by generation:

For OpenEvolve output:

from cudanalyst.helper.ckpt import group_oe_by_gen

# ? path to the OpenEvolve output checkpoints
SRC_CKPT_DIR = Path("./out/npb-CG/checkpoints")
# ? path to keep the generation-level samples
DST_GEN_DIR = Path("./tmp/npb-CG")

group_oe_by_gen(SRC_CKPT_DIR, DST_GEN_DIR)

For LLM4AD output:

from cudanalyst.helper.ckpt import group_llm4ad_by_gen

SRC_SAMPLE_RECORD = Path("./out/eoh-3MM/samples/samples_1~200.json")
DST_GEN_DIR = Path("./tmp/eoh-3MM")

group_llm4ad_by_gen(SRC_SAMPLE_RECORD, DST_GEN_DIR)

Resulting sturcture looks like:

tmp/npb-CG/
├── gen0/
├── gen1/
    ...

Use the template ./config/keyset_template.yml to configure your LLM API key, endpoint, and model. Make sure the chat_config_path variable points to this file.

Define the AnalysisMask to control the feedback granularity of each module.

from cudanalyst.module.config import ModuleBits
from cudanalyst.pipeline.config import AnalysisMask

mask = AnalysisMask(
    debug=ModuleBits.MODE_FULL,
    anlz=ModuleBits.MODE_FULL,
    perf=ModuleBits.MODE_FULL,
    plan=ModuleBits.MODE_FULL,
)

MODE_FULL: Structured summarized feedback (recommended)
MODE_RAW: Raw, unprocessed feedback
MODE_NONE: Disable the module

This allows controlled ablation experiments.

Launch async multi-rollout intervention.

from cudanalyst.workflow import intervene_async

problem_dir = Path("benchmark/hpc/npb/src/CG")
output_root = Path("gen/npb-CG")
res_list = asyncio.run(
    intervene_async(
        evaluate_func=evaluate,
        input_ckpt_dir=DST_GEN_DIR,
        output_root_dir=output_root,
        chat_config_path=CONFIG_FILE,
        config_mask=mask,
        num_run=3,    # * rollout
        num_trials=5,  # * k
        max_workers=32,
        llm_concurrency=16,
        problem_dir=problem_dir,
    )
)

num_run: Number of rollouts per evaluation
num_trials: Used for pass@k computation
max_workers: Local parallel workers
llm_concurrency: Concurrent LLM API calls

Output structure:

gen/npb-CG
├── p7-d7-a7-p7
│   ├── run-0
│   ├── run-1
│   └── run-2
└── p7-d7-a7-p7.json

You may use intervene_async_multi() to launch concurrent evaluation on different feedback granularity when num_run is small.

Compute statistics.

from cudanalyst.helper.stat import compute_label_stats

res_stat = compute_label_stats(res_list, ["fast", "pass"], as_df=True)

It returns a dictionary or a pd.DataFrame when as_df=True. Supported labels are as defined in class Status(str, Enum):

# src/cudanalyst/helper/exec.py

class Status(str, Enum):
    FAIL = "fail"
    COMPILE = "compile"
    PASS = "pass"

    # only be set with check_fast()
    FAST = "fast"

    @property
    def rank(self):
        _ranks = {Status.FAIL: 0, Status.COMPILE: 1, Status.PASS: 2, Status.FAST: 3}
        return _ranks.get(self, -1)

    def __lt__(self, other):
        if self.__class__ is other.__class__:
            return self.rank < other.rank
        return NotImplemented

Can be used for calculating pass@$k$ and plotting generation-level curves.

We provide ReplayPipeAsync (ReplayPipe) to replay previously generated intervention plans for plan distillation experiments. In the intervene_async()/intervene_async_multi() (intervene_sync()):
- Set input_ckpt_dir to the intervention result directory
- Set replay=True

Known Issues

We focus on how feedback-to-plan decisions are made under a fixed context (static assessment). Support for short-horizon, memory-aware evaluation is planned for future work.

DIY Evaluator

We adopt an OpenEvolve-style evaluator for convenience (see docs).

1. Create your own evaluator

Each benchmark suite under the ./benchmark directory provides a reference eval.py. An evaluator typically defines:

from cudanalyst.result import *

@dataclass
class YourMeta(ResultMeta):
    # add custom metadata fields here

@dataclass
class YourResult(Result):
    base_result: YourMeta = None
    custom_result: YourMeta = None
    # add additional statistics if needed

def _exec(...) -> YourMeta:
    # implement execution logic here
    ...

@return_asdict
def evaluate(program_path: os.PathLike, problem_dir: os.PathLike, config: AnalysisCfg = None):
    program_path = Path(program_path)
    with tempfile.TemporaryDirectory() as tmpdir:
        tmp_path = Path(tmpdir)

        dst_path = tmp_path / "src"
        shutil.copytree(SRC_DIR, dst_path)

        code_path = dst_path / "Solution.cu"
        code_path.write_text(extract_codeblock(program_path.read_text()))

        ctx = ToolContext(code_path=code_path, cwd=dst_path)

        gpu_id = pick_idle_gpu()

        base_result = _exec(...)
        custom_result = _exec(...)
        
        score = ... # compute combined score
        return YourResult(
            combined_score=score,
            base_result=base_result,
            custom_result=custom_result
        )

2. Integrate CUDAnalyst

You can plug CUDAnalyst into your evaluator via planning():

from cudanalyst import AnalysisCfg, ToolContext, planning

def _exec(...):
    config = AnalysisCfg()
    ctx = ToolContext(
        code_path=(task_dir / task_name / "sol.cu"), 
        cwd=task_dir,
    )
    ctx.cmd = ["make", task_name, "CLASS=S"]  # same format as subprocess.run
    return YourResult(reports=planning(config, ctx))

Configuring `AnalysisCfg`

Two options:

Load from YAML (template: ./config/cudanalyst_template.yml)

chat_config_path: "./config/keyset.yml"

debug_cfg:
    enabled: true
    formatted: true
    summarized: true

anlz_cfg:
    enabled: true
    formatted: true
    summarized: true

perf_cfg:
    enabled: true
    formatted: true
    summarized: true

plan_cfg:
    enabled: true
    summarized: true

from cudanalyst import load_analysis_cfg, AnalysisCfg

config: AnalysisCfg = load_analysis_cfg("./config/cudanalyst_template.yml")

Apply a bitmask

from cudanalyst import load_analysis_cfg, AnalysisCfg, AnalysisMask
from cudanalyst.module.config import ModuleBits

config_mask = mask = AnalysisMask(
    debug=ModuleBits.MODE_RAW,   # raw feedback, no summary
    anlz=ModuleBits.MODE_FULL,   # summarized feedback by an agent
    perf=ModuleBits.MODE_NONE    # disabled
    plan=ModuleBits.MODE_FULL,   # explicit planning
)

config = apply_config_mask(AnalysisCfg(chat_config_path), config_mask)

Setting `ToolContext`

code_path: path to the source code to analyze
cmd: command to be executed (list of strings, same format as subprocess.run)
cwd: working directory for the analysis

Contributing

We welcome contributions, especially for porting existing CUDA benchmarks or adding new agentic frameworks and analysis pipelines.

Development Setup

This repository uses uv to manage dependencies and development environment, and pre-commit to enforce code style and formatting rules.

To set up your environment and enable pre-commit hooks:

uv sync --dev      # install/update development dependencies
pre-commit install # enable pre-commit hooks for code style

After this, any committed code will automatically be checked and formatted according to the project's standards.

Citation

If you find this project useful, please consider citing:

% to be updated
@inproceedings{anonymous2026towards,
    title        = {
        Towards Feedback-to-Plan Decisions for Self-Evolving {LLM} Agents in
        {CUDA} Kernel Generation
    },
    author       = {Anonymous},
    year         = 2026,
    booktitle    = {Forty-third International Conference on Machine Learning},
    url          = {https://openreview.net/forum?id=s70zO5Lvvj}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
benchmark		benchmark
config		config
eval		eval
src/cudanalyst		src/cudanalyst
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
demo.py		demo.py
pyproject.toml		pyproject.toml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICML2026] Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

Motivation

Introduction

Installation

Usage

OpenEvolve evolution

LLM4AD evolution

Generation-level Intervention

Known Issues

DIY Evaluator

1. Create your own evaluator

2. Integrate CUDAnalyst

Configuring `AnalysisCfg`

Setting `ToolContext`

Contributing

Development Setup

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[ICML2026] Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

Motivation

Introduction

Installation

Usage

OpenEvolve evolution

LLM4AD evolution

Generation-level Intervention

Known Issues

DIY Evaluator

1. Create your own evaluator

2. Integrate CUDAnalyst

Configuring AnalysisCfg

Setting ToolContext

Contributing

Development Setup

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Configuring `AnalysisCfg`

Setting `ToolContext`