I hope this repo can provide a good starting point for continuing this line of work.
This project uses pre-commit hooks to ensure code quality:
-
Install browser drivers. This step is platform-dependent.
firefox geckodriverYou can verify your installation by running
python src/version_finder.pyto check if all test cases pass. -
Install dependencies:
pip install -e .[dev] -
Install pre-commit hooks:
pre-commit installThe following checks will run automatically on commit:
- Mypy for type checking
- Ruff for linting and formatting
-
Install submodules:
git submodule update --init --recursive cd src/swe_bench pip install -e .Note: this repository currently uses my fork of
swe-benchwith fixes to the log parsers. You can review all changes on this branch. We've imported limited functionality fromSWE-bench, and migrating these functions directly into this project may be beneficial in the future. -
Please set
ANTHROPIC_API_KEYandGITHUB_TOKENin your environment variables.
-
Download experiment logs from here. Code for generating figures and analysis is available in
src/report/plot.ipynb. -
Examining these logs provides valuable insights into model behavior, including retry patterns and failure modes.
The main entrypoint is src/main.py. Run experiments with:
python src/main.py --start-index 0 --num-processes 32 --exp-name report
Please remember to adjust the number of processes to your machine's capacity.
You should add the --use-test-lock flag when running tests for requests. This prevents running tests concurrently, which caused problems in my experiments.
python src/main.py --start-index 0 --num-processes 32 --exp-name report/requests --dataset psf/requests --clean-up --use-test-lock
Please bear in mind that the current docker infra could lead to dangling images, and other build caches. Running these experiments could take up a lot of disk space. I ended up babysitting the experiments I reported in the blogpost, and call docker system prune in between.
I have described a list of research directions in my blogpost, and here we mention a few engineering todos.
-
The most important item is to use remote code execution so that this data collection process becomes scalable. Based on
SWE-bench,modalis a good candidate. -
Current implementation only supports Claude. Expanding to other LLMs could be a good first issue.
-
Current implementation only uses file-based logging, and integrating with some monitoring platforms would be crucial for scaling.
-
Tweak the prompt structure can enable prompt caching and save costs.
Many things are borrowed from SWE-bench.