SWE Bench Infinite: Scaling RL Environments for SWE Agents

I hope this repo can provide a good starting point for continuing this line of work.

Development Setup

This project uses pre-commit hooks to ensure code quality:

Install browser drivers. This step is platform-dependent.
```
firefox
geckodriver
```
You can verify your installation by running python src/version_finder.py to check if all test cases pass.
Install dependencies:
```
pip install -e .[dev]
```
Install pre-commit hooks:
```
pre-commit install
```
The following checks will run automatically on commit:
- Mypy for type checking
- Ruff for linting and formatting
Install submodules:
```
git submodule update --init --recursive
cd src/swe_bench
pip install -e .
```
Note: this repository currently uses my fork of swe-bench with fixes to the log parsers. You can review all changes on this branch. We've imported limited functionality from SWE-bench, and migrating these functions directly into this project may be beneficial in the future.
Please set ANTHROPIC_API_KEY and GITHUB_TOKEN in your environment variables.

Experiments

Replicating my results

Download experiment logs from here. Code for generating figures and analysis is available in src/report/plot.ipynb.
Examining these logs provides valuable insights into model behavior, including retry patterns and failure modes.

Running the experiments

The main entrypoint is src/main.py. Run experiments with:

python src/main.py --start-index 0 --num-processes 32 --exp-name report

Please remember to adjust the number of processes to your machine's capacity.

You should add the --use-test-lock flag when running tests for requests. This prevents running tests concurrently, which caused problems in my experiments.

python src/main.py --start-index 0 --num-processes 32 --exp-name report/requests --dataset psf/requests --clean-up --use-test-lock

Please bear in mind that the current docker infra could lead to dangling images, and other build caches. Running these experiments could take up a lot of disk space. I ended up babysitting the experiments I reported in the blogpost, and call docker system prune in between.

Contributing

I have described a list of research directions in my blogpost, and here we mention a few engineering todos.

The most important item is to use remote code execution so that this data collection process becomes scalable. Based on SWE-bench, modal is a good candidate.
Current implementation only supports Claude. Expanding to other LLMs could be a good first issue.
Current implementation only uses file-based logging, and integrating with some monitoring platforms would be crucial for scaling.
Tweak the prompt structure can enable prompt caching and save costs.

Acknowledgements

Many things are borrowed from SWE-bench.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
docker		docker
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SWE Bench Infinite: Scaling RL Environments for SWE Agents

Development Setup

Experiments

Replicating my results

Running the experiments

Contributing

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

Tiiiger/swe-bench-infinite

Folders and files

Latest commit

History

Repository files navigation

SWE Bench Infinite: Scaling RL Environments for SWE Agents

Development Setup

Experiments

Replicating my results

Running the experiments

Contributing

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages