Skip to content

Tiiiger/swe-bench-infinite

Repository files navigation

SWE Bench Infinite: Scaling RL Environments for SWE Agents

Build License

I hope this repo can provide a good starting point for continuing this line of work.

Development Setup

This project uses pre-commit hooks to ensure code quality:

  1. Install browser drivers. This step is platform-dependent.

    firefox
    geckodriver
    

    You can verify your installation by running python src/version_finder.py to check if all test cases pass.

  2. Install dependencies:

    pip install -e .[dev]
    
  3. Install pre-commit hooks:

    pre-commit install
    

    The following checks will run automatically on commit:

    • Mypy for type checking
    • Ruff for linting and formatting
  4. Install submodules:

    git submodule update --init --recursive
    cd src/swe_bench
    pip install -e .
    

    Note: this repository currently uses my fork of swe-bench with fixes to the log parsers. You can review all changes on this branch. We've imported limited functionality from SWE-bench, and migrating these functions directly into this project may be beneficial in the future.

  5. Please set ANTHROPIC_API_KEY and GITHUB_TOKEN in your environment variables.

Experiments

Replicating my results

  1. Download experiment logs from here. Code for generating figures and analysis is available in src/report/plot.ipynb.

  2. Examining these logs provides valuable insights into model behavior, including retry patterns and failure modes.

Running the experiments

The main entrypoint is src/main.py. Run experiments with:

python src/main.py --start-index 0 --num-processes 32 --exp-name report

Please remember to adjust the number of processes to your machine's capacity.

You should add the --use-test-lock flag when running tests for requests. This prevents running tests concurrently, which caused problems in my experiments.

python src/main.py --start-index 0 --num-processes 32 --exp-name report/requests --dataset psf/requests --clean-up --use-test-lock

Please bear in mind that the current docker infra could lead to dangling images, and other build caches. Running these experiments could take up a lot of disk space. I ended up babysitting the experiments I reported in the blogpost, and call docker system prune in between.

Contributing

I have described a list of research directions in my blogpost, and here we mention a few engineering todos.

  1. The most important item is to use remote code execution so that this data collection process becomes scalable. Based on SWE-bench, modal is a good candidate.

  2. Current implementation only supports Claude. Expanding to other LLMs could be a good first issue.

  3. Current implementation only uses file-based logging, and integrating with some monitoring platforms would be crucial for scaling.

  4. Tweak the prompt structure can enable prompt caching and save costs.

Acknowledgements

Many things are borrowed from SWE-bench.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published