Setup

This repo is based on MLAgentBench.

Create a conda enviroment with your TASK_NAME.

Then install the MLAgentBench package with

pip install -e .
pip install openai

Install dependencies with python 3.10 by running

bash install.sh

(Optional) For Kaggle datasets, you need to set up Kaggle API and authentication (~/.kaggle/kaggle.json) as described here. You may also need to provide manual consent to the rules of specific competitions by following the prompts.

Tasks

Each task is a folder in MLAgentBench/benchmarks_base/, under which the env/ folder contains files that the research agent will see at the beginning, and script/ folder contains additional hidden files such as prepare.py for downloading data.

Instructions for Refactoring:

Steps:

Fork this github repo to your own github space.
Complete steps in Setup Section for the MLAgentBench packages.
Create a new task folder under MLAgentBench/benchmarks_base, following the template.
add runtime and performance of your baseline method in MLAgentBench/constants.py (Repeat your run multiple times to ensure consistency; the score should remain relatively stable across runs.)
Submit a pull request.

Here are the commands to test your newly added tasks:

# prepare conda environment and data
cd MLAgentBench/benchmarks_base/${TASK_NAME}/scripts/
conda env create -f environment.yml
conda activate ${TASK_NAME}
# We will install MLAgentBench and openai packages in the newly created conda environment
python prepare.py

# evaluate baseline method on validation set
cd ../env
python main.py -m my_method -p dev

# evaluate baseline method on test set
cp -r ../scripts/test_data/* data/ # prepare test data (updated)
cp ../scripts/test_constants.py constants.py # prepare test-time configuration
python main.py -m my_method -p test

Also if possible, please include a background.txt file under scripts folder with excerpt from relevant papers or technical reports written by competition participants (besides baseline paper) containing description and core code for relevant methods. See this for an example on llm-merging task. This info will be used to inspire LLM agents for better solutions.

The goal of refactored code is to achieve the following requirements:

Basically this command stays constant: python main.py -m my_method -p dev/test and then any code that could deal with evaluation metrics should be read_only and need to make sure read_only files don’t contain stuff that are necessary for training and that the agent could need to modify for their implementation.

Others:

The LLM agent will be able to “see” all files under env/ folder so make sure not to put any test-time information (including test data and model name used in test phases) there to avoid LLM agent “cheating”.
Also put all test data under scripts/test_data
Your code should not attempt to access internet. Any pretrained models, datasets should be downloaded beforehand by prepare.py.

Pro tips

You may use ChatGPT to help you refactor the code and further tweak upon the generated code. Feel free to use the template prompt I developed here, which relies on print_all_dir_files.py that gives you the concatenation of all files under a specified directory.

Name		Name	Last commit message	Last commit date
Latest commit History 291 Commits
MLAgentBench		MLAgentBench
leaderboard_metrics		leaderboard_metrics
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
capability_level.py		capability_level.py
capability_levels.json		capability_levels.json
install.sh		install.sh
plot.py		plot.py
print_all_dir_files.py		print_all_dir_files.py
report.py		report.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Setup

Tasks

Instructions for Refactoring:

Pro tips

About

Uh oh!

Releases

Packages

Languages

License

monmonli/MLAgentBench

Folders and files

Latest commit

History

Repository files navigation

Setup

Tasks

Instructions for Refactoring:

Pro tips

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages