Skip to content

monmonli/MLAgentBench

 
 

Repository files navigation

Setup

This repo is based on MLAgentBench.

Create a conda enviroment with your TASK_NAME.

Then install the MLAgentBench package with

pip install -e .
pip install openai

Install dependencies with python 3.10 by running

bash install.sh

(Optional) For Kaggle datasets, you need to set up Kaggle API and authentication (~/.kaggle/kaggle.json) as described here. You may also need to provide manual consent to the rules of specific competitions by following the prompts.

Tasks

Each task is a folder in MLAgentBench/benchmarks_base/, under which the env/ folder contains files that the research agent will see at the beginning, and script/ folder contains additional hidden files such as prepare.py for downloading data.

Instructions for Refactoring:

Steps:

  • Fork this github repo to your own github space.
  • Complete steps in Setup Section for the MLAgentBench packages.
  • Create a new task folder under MLAgentBench/benchmarks_base, following the template.
  • add runtime and performance of your baseline method in MLAgentBench/constants.py (Repeat your run multiple times to ensure consistency; the score should remain relatively stable across runs.)
  • Submit a pull request.

Here are the commands to test your newly added tasks:

# prepare conda environment and data
cd MLAgentBench/benchmarks_base/${TASK_NAME}/scripts/
conda env create -f environment.yml
conda activate ${TASK_NAME}
# We will install MLAgentBench and openai packages in the newly created conda environment
python prepare.py

# evaluate baseline method on validation set
cd ../env
python main.py -m my_method -p dev

# evaluate baseline method on test set
cp -r ../scripts/test_data/* data/ # prepare test data (updated)
cp ../scripts/test_constants.py constants.py # prepare test-time configuration
python main.py -m my_method -p test

Also if possible, please include a background.txt file under scripts folder with excerpt from relevant papers or technical reports written by competition participants (besides baseline paper) containing description and core code for relevant methods. See this for an example on llm-merging task. This info will be used to inspire LLM agents for better solutions.

The goal of refactored code is to achieve the following requirements:

Basically this command stays constant: python main.py -m my_method -p dev/test and then any code that could deal with evaluation metrics should be read_only and need to make sure read_only files don’t contain stuff that are necessary for training and that the agent could need to modify for their implementation.

Others:

  • The LLM agent will be able to “see” all files under env/ folder so make sure not to put any test-time information (including test data and model name used in test phases) there to avoid LLM agent “cheating”.
  • Also put all test data under scripts/test_data
  • Your code should not attempt to access internet. Any pretrained models, datasets should be downloaded beforehand by prepare.py.

Pro tips

You may use ChatGPT to help you refactor the code and further tweak upon the generated code. Feel free to use the template prompt I developed here, which relies on print_all_dir_files.py that gives you the concatenation of all files under a specified directory.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.5%
  • Other 0.5%