Environment Setup:
uv syncIn general, the evaluation can be done by:
- Obtain the benchmark datasets for the four tasks.
- Change the LLM API information in mini-swe-agent config files. You can either query an existing endpoint or host one yourself.
- For any benchmark dataset, use mini-swe-agent to finish the instances, i.e., generating the trajectories.
- Run the corresponding evaluation harness to score the answers parsed from the trajectories.
Benchmark datasets filtered with knowledge-cutoff protocol (after 2020-12-31):
❯ ls eval/data
fea qa sbv tddmini-swe-agent config files for each task:
❯ ls eval | grep yaml
fea_host.yaml
qa_host.yaml
sbv_host.yaml
tdd_host.yamlYou need to change the LLM API information in these config files.
Evaluation scripts for the four tasks:
❯ ls eval | grep sh
sbv.sh # SWE-Bench-Verified
tdd.sh # TDD-Bench-Verified
fea.sh # FEA-Bench
qa.sh # SWE-QATake a look at each to know how to specify the arguments with environment variables, like:
VERSION=0 WORKERS=6 MS=qwen34i CONFIG=eval/sbv_host.yaml REPO=django HASH=e13b714 eval/sbv.shms-swift is leveraged to perform SFT in our experiments. But of course, you can use other libraries to do SFT.
To use ms-swift, it is recommended to:
- Create a new Python environment dedicated to
ms-swift. (Not the one used for evaluation.) - Setup
ms-swiftin the environment:
An example script for training the Django expert model: train/mix_django.sh
- The 4-unit RCX dataset is used for training. For each unit, we sample 2k instances, so the total training dataset is 8k instances. See the argument
--datasetin the script. - The dataset is available at https://huggingface.co/datasets/swespot/sft-v0 . Clone it somewhere, and set the environment variable
DATA_DIRto the path of the cloned dataset in the script:export DATA_DIR=/path/to/swespot-sft-v0-hf-repo. - Similarly, you can train expert models for other repositories.
Example trained models for the seven selected repositories in the paper can be found at Hugging Face, such as https://huggingface.co/swespot/django-sft-v0 .