π₯Quick Start β’ π»LLM code β’ πPapers β’ π¨Tools β’ π·Development β’ πAcknowledgement
Warning
π¨ Evaluating LLM-generated code over datasets with "3 test-cases" is **NOT** enough! π¨
To address this, we started the EvalPlus project -- a rigourous evaluation framework for LLM4Code that:
- β¨ improves code benchmarks by adding up to thousands of new tests! (81x new tests for HumanEval!)
- β¨ crafts a set utility tools to sanitize, visualize and inspect LLM-generated code and evaluation results!
- β¨ accelerates LLM4Code research by open-sourcing LLM-generated samples for 14+ models -- no need to re-run the expensive benchmarks!
To get started, please first setup the environment:
pip install evalplus --upgrade...Or you can try out the latest developing version:
pip install "git+https://github.com/evalplus/evalplus.git" --upgradeπ€ Want to use local GitHub repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txtThe usage is just like the original HumanEval where you just need to implement the generate_one_completion function!
from evalplus.data import get_human_eval_plus, write_jsonl
problems = get_human_eval_plus()
num_samples_per_task = 200
samples = [
dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
for task_id in problems
for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)π€ What is in a `problem`? :: click to expand ::
task_idis the identifier string for the taskentry_pointis name of the functionpromptis the function signature with docstring
canonical_solutionis the ground-truth implementation (re-implemented to fix bugs in HumanEval)base_inputis the test inputs in original HumanEvalplus_inputis the test inputs brought by EvalPlus
To evaluate the samples:
You are strongly recommended to use a sandbox such as docker:
docker run -v $(pwd):/app ganler/evalplus:v0.1.1 --dataset humaneval --samples samples.jsonl...Or if you want to try it locally regardless of the risks
evalplus.evaluate --dataset humaneval --samples samples.jsonlπ Try out HumanEvalPlus-Mini! which selects a minimal set of additional tests with the highest quality, achieving almost the same effectiveness of the full version. Just add a --mini flag, it can run 23+% faster! (even faster if you evaluate all tests regardless of fail-stop).
docker run -v $(pwd):/app ganler/evalplus:v0.1.1 --dataset humaneval --samples samples.jsonl --mini
# ...Or locally β οΈ
# evalplus.evaluate --dataset humaneval --samples samples.jsonlπ€ Want to use local GitHub repo? :: click to expand ::
python evalplus/evaluate.py --dataset humaneval --samples samples.jsonlβ¨οΈ More command-line flags :: click to expand ::
--parallel: by default half of the cores--base-only(store_ture): only run base HumanEval tests--i-just-wanna-run: force a re-run
π€ How long it would take? :: click to expand ::
When running 200 samples x 164 tasks x ~775 tests, it can take around 4-8 minute by using --parallel 64 and --test-details.
Here are some tips to speed up the evaluation:
- Use
--parallel $(nproc) - Do not use
--test-detailsif you just want to quickly get pass@k as--test-detailswill run all tests (~775 on average for each task), while without--test-detailsthe testing for a sample stops immediately when it fails the first test. - Use our pre-evaluated results (see LLM-generated code)
- We will release an distilled version of HumanEval+ soon. Stay tuned!
The output should be like (below is GPT-4 greedy decoding example):
Computing expected output...
Expected outputs computed in 15.18s
Reading samples...
164it [00:04, 37.79it/s]
Evaluating samples...
100%|ββββββββββββββββββββββββββββββββββββββββββ| 164/164 [00:03<00:00, 44.75it/s]
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.75}
Baseis thepass@kfor the original HumanEvalBase + Extrais thepass@kfor the our HumanEval+ (with extra tests)- The "k" includes
[1, 10, 100]where k values<=the sample size will be used - A cache file named like
samples_eval_results.jsonlwill be cached. Remove it to re-run the evaluation
Please kindly find the LLM-pre-generated code samples in the attachment of our v0.1.0 release.
Each sample file is packaged in a zip file named like ${model_name}_temp_${temperature}.zip.
You can unzip them to a folder named like ${model_name}_temp_${temperature} and run the evaluation from scratch with:
evalplus.evaluate --dataset humaneval --samples ${model_name}_temp_${temperature}Read our paper for more detailed findings!
@article{evalplus,
title={Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
author={Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang},
journal={arXiv preprint arXiv:2305.01210},
year={2023},
}To use these tools, please first install the repository from GitHub:
git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r requirements-tools.txtCheck LLM-produced code and answer the following questions:
- Is the generation entirely done for all samples / all problems in the dataset?
- Are LLM-generated code compilable? (if no, something could be wrong and you'd better check)
python tools/checker.py --folder /path/to/[model]-[??]b_temp_[??] --dataset humanevalLLM-generated code may contain some syntax errors. But some of them can be easily fixable by doing simple post-processing. This tool will make the LLM-generated code more clean/compilable by doing certain post-processing such as trimming with more magical EOFs and some garbage non-code tokens.
python tools/sanitize.py --eof --folder /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`python tools/render.py --type /path/to/[model]-[??]b # NOTE: no `_temp_[??]`Before you start:
pip install pre-commit
pre-commit install
export PYTHONPATH=$PYTHONPATH:$(pwd)evalplusis the package name.${DATASET}_plusis the name of dataset applied withevalplus.