Skip to content

ZJU-REAL/Code-A1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

Aozhe Wang1,*, Yuchen Yan1,*, Nan Zhou1,*, Zhengxi Lu1,*
Weiming Lu1, Jun Xiao1, Yueting Zhuang1, Yongliang Shen1,†

1Zhejiang University
*Equal contributions, †Corresponding authors

An adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM via reinforcement learning.

Paper Project Page


Code-A1 Framework

Code-A1 jointly trains a Code LLM and a Test LLM with opposing objectives, enables white-box adversarial test generation without self-collusion, and uses a Mistake Book replay mechanism to retrieve historical failure cases.

Table of Contents

Motivation

Reinforcement learning for code generation typically depends on unit-test pass rates as verifiable rewards. In practice, this creates three persistent issues:

  • Static golden tests are limited in coverage and quickly saturate as the code model improves.
  • Black-box generated tests are often too generic to expose implementation-specific bugs.
  • Single-model self-play introduces self-collusion: the same model can generate easy tests that inflate rewards without improving code quality.

Code-A1 addresses this by separating the two roles. A Code LLM is optimized to solve programming problems, while a Test LLM is optimized to expose errors in the generated code. This makes white-box test generation useful rather than dangerous: the Test LLM can inspect candidate implementations and generate targeted adversarial tests. The framework further stabilizes co-evolution with a Mistake Book replay mechanism and a composite reward that balances test validity and adversarial difficulty.

Code-A1 Introduction

✨ Highlights

  • Adversarial co-evolution: jointly optimizes a Code LLM and a Test LLM with opposite objectives instead of collapsing both roles into one model.
  • White-box test generation without self-collusion: the Test LLM conditions on candidate code to synthesize implementation-specific attack tests.
  • Mistake Book replay: historical failure cases are replayed during training so the Code LLM does not forget earlier weaknesses.
  • Composite reward for Test LLM: balances executable test validity with adversarial difficulty.
  • Strong empirical gains: improves both code generation and test generation, and generated tests can even serve as competitive static supervision.

πŸ›  Installation

The project uses Python 3.10 and uv for environment management. The provided scripts create two separate environments: one for RL training and one for evaluation.

1. RL environment

cd Code-A1
bash rl_env.sh
source .venv-rl/bin/activate

This script creates .venv-rl, installs the local verl fork in editable mode, and installs the main runtime dependencies including ray, vllm/sglang support, and sandbox_fusion.

2. Evaluation environment

cd Code-A1
bash eval_env.sh
source .venv-eval/bin/activate

This script creates .venv-eval and installs the packages needed for BigCodeBench evaluation, mutation-based tooling, vllm==0.11.0, wandb, and sandbox_fusion.

3. Runtime prerequisites

Before training or evaluation, configure environment variables in Code-A1/code/rl/run/set_env.sh:

export SANDBOX_FUSION_ENDPOINT=YOUR_SANDBOX_IP
export WANDB_API_KEY=YOUR_WANDB_API_KEY

The training scripts also expect:

  • A reachable sandbox_fusion service for secure code execution.
  • GPUs compatible with the selected model scale and FSDP-based training.
  • Access to the base models specified in the YAML configs, such as Qwen/Qwen2.5-Coder-1.5B-Instruct.

πŸ“Š Dataset

The training data is stored in Code-A1/code/rl/train_data/kodcode_hard_dual_model_training_data_mix.parquet and is built from 9,688 hard-difficulty questions from KodCode-V1.

πŸš€ Quick Start

Sandbox check

Validate that the sandbox executor is reachable before running training:

cd Code-A1
bash test_sandbox.sh

Expected output should contain a successful RunCodeResponse with return_code=0.

RL training

Provided launch scripts:

cd Code-A1
source .venv-rl/bin/activate

bash code/rl/run/1.5B_A1.sh
bash code/rl/run/3B_A1.sh
bash code/rl/run/7B_A1.sh

For example, the default 1.5B config uses:

  • Qwen/Qwen2.5-Coder-1.5B-Instruct as the Code LLM
  • Qwen/Qwen2.5-Coder-1.5B-Instruct as the Test LLM
  • alpha: 0.5 in the composite test reward
  • n_gpus_per_node: 4 for both Code LLM and Test LLM
  • rollout.n: 8 during training and n: 32 during validation sampling

Evaluation

For the Code LLM:

  • HumanEval+ and MBPP+ are included in the validation data and are evaluated during training.
  • BigCodeBench is evaluated separately with:
cd Code-A1
source .venv-eval/bin/activate
bash code/rl/run/eval.sh

For the Test LLM, evaluation is conducted with UnLeakedTestBench

πŸ“ˆ Main Results

Code generation

Code-A1 outperforms both the Golden Tests baseline trained on human annotations and the Self-Play approach on HumanEval+, MBPP+, and BigCodeBench.

Code LLM Method HumanEval+ MBPP+ BigCodeBench Avg
Qwen2.5-Coder-1.5B-Instruct Base 63.42 60.87 29.34 51.21
Qwen2.5-Coder-1.5B-Instruct Golden Tests 71.15 63.30 34.23 56.23
Qwen2.5-Coder-1.5B-Instruct Self-Play 70.64 63.54 33.47 55.88
Qwen2.5-Coder-1.5B-Instruct Code-A1 72.69 63.33 34.82 56.95
Qwen2.5-Coder-3B-Instruct Base 77.63 63.12 41.78 60.84
Qwen2.5-Coder-3B-Instruct Golden Tests 81.96 68.05 45.41 65.14
Qwen2.5-Coder-3B-Instruct Self-Play 81.86 67.06 45.09 64.67
Qwen2.5-Coder-3B-Instruct Code-A1 83.52 69.07 45.85 66.15
Qwen2.5-Coder-7B-Instruct Base 83.69 71.95 49.41 68.35
Qwen2.5-Coder-7B-Instruct Golden Tests 84.68 74.16 52.28 70.37
Qwen2.5-Coder-7B-Instruct Self-Play 84.70 74.23 52.25 70.39
Qwen2.5-Coder-7B-Instruct Code-A1 85.21 74.50 52.46 70.72

Test generation

The Test LLM also improves substantially under adversarial co-evolution. Notably, the 3B Test LLM trained with Code-A1 reaches Mul = 15.29, exceeding the unoptimized 7B base model (14.72).

Test LLM Method pass@5 mut@5 Mul
Qwen2.5-Coder-1.5B-Instruct Base 16.29 22.30 3.63
Qwen2.5-Coder-1.5B-Instruct SFT 14.76 29.45 4.35
Qwen2.5-Coder-1.5B-Instruct Self-Play 23.39 28.91 6.76
Qwen2.5-Coder-1.5B-Instruct Code-A1 27.05 26.41 7.14
Qwen2.5-Coder-3B-Instruct Base 20.93 42.55 8.91
Qwen2.5-Coder-3B-Instruct SFT 23.51 36.29 8.53
Qwen2.5-Coder-3B-Instruct Self-Play 29.64 50.92 15.09
Qwen2.5-Coder-3B-Instruct Code-A1 30.86 49.56 15.29
Qwen2.5-Coder-7B-Instruct Base 28.73 51.25 14.72
Qwen2.5-Coder-7B-Instruct SFT 28.72 50.85 14.60
Qwen2.5-Coder-7B-Instruct Self-Play 35.13 55.57 19.52
Qwen2.5-Coder-7B-Instruct Code-A1 37.15 53.14 19.74

πŸ“„ Citation

@misc{wang2026codea1adversarialevolvingcode,
      title={Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning}, 
      author={Aozhe Wang and Yuchen Yan and Nan Zhou and Zhengxi Lu and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen},
      year={2026},
      eprint={2603.15611},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.15611}, 
}

πŸ™ Acknowledgement

The RL training stack is built on top of the excellent verl framework, which is included in this repository under Code-A1/verl. Many thanks to the verl team for open-sourcing the infrastructure that this project extends.

About

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages