[ICML2026] Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation
CUDAnalyst (CUDA + Analyst) is an analysis framework for studying how self-evolving LLM agents make planning decisions during CUDA kernel generation. Based on CUDAnalyst, we perform generation-level feedback interventions in self-evolving LLM agents for CUDA kernels, attributing contributions of feedback signals to plan decisions.
Rather than treating agent execution as a monolithic end-to-end process, where trajectory drift makes controlled attribution difficult, CUDAnalyst decouples feedback acquisition from planning generation, enabling controlled intervention and generation-level attribution across iterative optimization trajectories.
The framework supports principled evaluation of heterogeneous feedback signals, including correctness (Debugger), static analysis (Analyzer) , and runtime performace (Profiler), and measures how each signal substrate influences subsequent planning behavior (PlanAgent).
To support scalable experimentation, CUDAnalyst adopts a sample-centric, event-driven execution architecture with concurrent pipelined evaluation, namely IntervenePipe.
|
|
|
git clone https://github.com/yuxuan-z19/cudanalyst.git
cd cudanalyst
# install PyTorch with appropriate CUDA backend
uv pip install torch --torch-backend=cu124
# for simple usage
uv pip install -e .
# for development
uv sync --devEnsure you have installed Nsight Compute >= 2025.2.1 and added the absolute path of its extras/python directory to the $PYTHONPATH environment variable.
❯ export PYTHONPATH="/data/zyx/local/nsight-compute-2025.2.1/extras/python:$PYTHONPATH"
❯ python -c "import ncu_report;"
❯ echo $?
0This project supports three main workflows:
- OpenEvolve-based evolution
- LLM4AD-based evolution
- Generation-level Intervention (multi-rollout intervention on evolution trajectories)
The recommended order is: Run evolution
Install the modified OpenEvolve fork:
git submodule update --init --recursive
cd ./eval/openevolve && uv pip install -e .Example using the NPB-GPU CG workload is provided in ./eval/demo/oe-npb.bash:
SUITE_ROOT="benchmark/hpc/npb"
WORKLOAD="CG"
WORKLOAD_DIR=$SUITE_ROOT/src/$WORKLOAD
python eval/openevolve/openevolve-run.py \
$WORKLOAD_DIR/sol.init.cu \
$SUITE_ROOT/eval.py \
-c eval/config/openevolve.yml \
-o out/npb-$WORKLOAD \
-p $WORKLOAD_DIR \
-a config/cudanalyst_template.yml \Compared to the original OpenEvolve, two new options are introduced:
-
-p: Specifies the subtask evaluation directory. -
-a: Specifies the CUDAnalyst config file path. Default: None
These two options are passed to the task-specific evaluate() as arguments.
After execution, the output directory will look like:
out/npb-CG/
├── best/
├── checkpoints/
└── logs/The checkpoints/ directory will be used in the generation-level intervention stage.
Install the modified LLM4AD fork:
git submodule update --init --recursive
cd ./eval/llm4ad
uv pip install -r requirements.txt
uv pip install -e .For usage, please refer to LLM4AD docs. We provide a example on PolyBench-ACC 3MM with EoH algorithm in ./eval/demo/llm4ad_eoh.py.
After execution, the output directory will look like:
out/eoh-3MM/
├── population/
├── run_log.txt
└── samples/
├── samples_1~200.json
└── samples_best.jsonThe samples_1~200.json file which keeps all the kernels generated will be used in the generation-level stage.
This workflow enables:
- Grouping evolution results by generation
- Performing multi-rollout intervention on each generation
- Computing pass@$k$-like statistics
See demo.py for a minimal example.
-
Group evolution outputs by generation:
-
For OpenEvolve output:
from cudanalyst.helper.ckpt import group_oe_by_gen # ? path to the OpenEvolve output checkpoints SRC_CKPT_DIR = Path("./out/npb-CG/checkpoints") # ? path to keep the generation-level samples DST_GEN_DIR = Path("./tmp/npb-CG") group_oe_by_gen(SRC_CKPT_DIR, DST_GEN_DIR)
-
For LLM4AD output:
from cudanalyst.helper.ckpt import group_llm4ad_by_gen SRC_SAMPLE_RECORD = Path("./out/eoh-3MM/samples/samples_1~200.json") DST_GEN_DIR = Path("./tmp/eoh-3MM") group_llm4ad_by_gen(SRC_SAMPLE_RECORD, DST_GEN_DIR)
Resulting sturcture looks like:
tmp/npb-CG/ ├── gen0/ ├── gen1/ ... -
-
Use the template
./config/keyset_template.ymlto configure your LLM API key, endpoint, and model. Make sure thechat_config_pathvariable points to this file. -
Define the
AnalysisMaskto control the feedback granularity of each module.from cudanalyst.module.config import ModuleBits from cudanalyst.pipeline.config import AnalysisMask mask = AnalysisMask( debug=ModuleBits.MODE_FULL, anlz=ModuleBits.MODE_FULL, perf=ModuleBits.MODE_FULL, plan=ModuleBits.MODE_FULL, )
-
MODE_FULL: Structured summarized feedback (recommended) -
MODE_RAW: Raw, unprocessed feedback -
MODE_NONE: Disable the module
This allows controlled ablation experiments.
-
-
Launch async multi-rollout intervention.
from cudanalyst.workflow import intervene_async problem_dir = Path("benchmark/hpc/npb/src/CG") output_root = Path("gen/npb-CG") res_list = asyncio.run( intervene_async( evaluate_func=evaluate, input_ckpt_dir=DST_GEN_DIR, output_root_dir=output_root, chat_config_path=CONFIG_FILE, config_mask=mask, num_run=3, # * rollout num_trials=5, # * k max_workers=32, llm_concurrency=16, problem_dir=problem_dir, ) )
-
num_run: Number of rollouts per evaluation -
num_trials: Used for pass@k computation -
max_workers: Local parallel workers -
llm_concurrency: Concurrent LLM API calls
Output structure:
gen/npb-CG ├── p7-d7-a7-p7 │ ├── run-0 │ ├── run-1 │ └── run-2 └── p7-d7-a7-p7.json
You may use
intervene_async_multi()to launch concurrent evaluation on different feedback granularity whennum_runis small. -
-
Compute statistics.
from cudanalyst.helper.stat import compute_label_stats res_stat = compute_label_stats(res_list, ["fast", "pass"], as_df=True)
It returns a dictionary or a
pd.DataFramewhenas_df=True. Supported labels are as defined inclass Status(str, Enum):# src/cudanalyst/helper/exec.py class Status(str, Enum): FAIL = "fail" COMPILE = "compile" PASS = "pass" # only be set with check_fast() FAST = "fast" @property def rank(self): _ranks = {Status.FAIL: 0, Status.COMPILE: 1, Status.PASS: 2, Status.FAST: 3} return _ranks.get(self, -1) def __lt__(self, other): if self.__class__ is other.__class__: return self.rank < other.rank return NotImplemented
Can be used for calculating pass@$k$ and plotting generation-level curves.
-
We provide
ReplayPipeAsync(ReplayPipe) to replay previously generated intervention plans for plan distillation experiments. In theintervene_async()/intervene_async_multi()(intervene_sync()):- Set
input_ckpt_dirto the intervention result directory - Set
replay=True
- Set
- We focus on how feedback-to-plan decisions are made under a fixed context (static assessment). Support for short-horizon, memory-aware evaluation is planned for future work.
We adopt an OpenEvolve-style evaluator for convenience (see docs).
Each benchmark suite under the ./benchmark directory provides a reference eval.py. An evaluator typically defines:
from cudanalyst.result import *
@dataclass
class YourMeta(ResultMeta):
# add custom metadata fields here
@dataclass
class YourResult(Result):
base_result: YourMeta = None
custom_result: YourMeta = None
# add additional statistics if needed
def _exec(...) -> YourMeta:
# implement execution logic here
...
@return_asdict
def evaluate(program_path: os.PathLike, problem_dir: os.PathLike, config: AnalysisCfg = None):
program_path = Path(program_path)
with tempfile.TemporaryDirectory() as tmpdir:
tmp_path = Path(tmpdir)
dst_path = tmp_path / "src"
shutil.copytree(SRC_DIR, dst_path)
code_path = dst_path / "Solution.cu"
code_path.write_text(extract_codeblock(program_path.read_text()))
ctx = ToolContext(code_path=code_path, cwd=dst_path)
gpu_id = pick_idle_gpu()
base_result = _exec(...)
custom_result = _exec(...)
score = ... # compute combined score
return YourResult(
combined_score=score,
base_result=base_result,
custom_result=custom_result
)You can plug CUDAnalyst into your evaluator via planning():
from cudanalyst import AnalysisCfg, ToolContext, planning
def _exec(...):
config = AnalysisCfg()
ctx = ToolContext(
code_path=(task_dir / task_name / "sol.cu"),
cwd=task_dir,
)
ctx.cmd = ["make", task_name, "CLASS=S"] # same format as subprocess.run
return YourResult(reports=planning(config, ctx))Two options:
-
Load from YAML (template:
./config/cudanalyst_template.yml)chat_config_path: "./config/keyset.yml" debug_cfg: enabled: true formatted: true summarized: true anlz_cfg: enabled: true formatted: true summarized: true perf_cfg: enabled: true formatted: true summarized: true plan_cfg: enabled: true summarized: true
from cudanalyst import load_analysis_cfg, AnalysisCfg config: AnalysisCfg = load_analysis_cfg("./config/cudanalyst_template.yml")
-
Apply a bitmask
from cudanalyst import load_analysis_cfg, AnalysisCfg, AnalysisMask from cudanalyst.module.config import ModuleBits config_mask = mask = AnalysisMask( debug=ModuleBits.MODE_RAW, # raw feedback, no summary anlz=ModuleBits.MODE_FULL, # summarized feedback by an agent perf=ModuleBits.MODE_NONE # disabled plan=ModuleBits.MODE_FULL, # explicit planning ) config = apply_config_mask(AnalysisCfg(chat_config_path), config_mask)
code_path: path to the source code to analyzecmd: command to be executed (list of strings, same format assubprocess.run)cwd: working directory for the analysis
We welcome contributions, especially for porting existing CUDA benchmarks or adding new agentic frameworks and analysis pipelines.
This repository uses uv to manage dependencies and development environment, and pre-commit to enforce code style and formatting rules.
To set up your environment and enable pre-commit hooks:
uv sync --dev # install/update development dependencies
pre-commit install # enable pre-commit hooks for code styleAfter this, any committed code will automatically be checked and formatted according to the project's standards.
If you find this project useful, please consider citing:
% to be updated
@inproceedings{anonymous2026towards,
title = {
Towards Feedback-to-Plan Decisions for Self-Evolving {LLM} Agents in
{CUDA} Kernel Generation
},
author = {Anonymous},
year = 2026,
booktitle = {Forty-third International Conference on Machine Learning},
url = {https://openreview.net/forum?id=s70zO5Lvvj}
}