Skip to content

CodeUltraFeedback: aligning large language models to coding preferences

License

Notifications You must be signed in to change notification settings

martin-wey/CodeUltraFeedback

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeUltraFeedback

Aligning Large Language Models to Coding Preferences

🤔 About🚀 Getting Started🧠 Models🤗 Datasets📝 Citation

Note

[03-17-2024] 🔥 We updated our code to support Claude-3 models for grading. CODAL-Bench now includes claude-3-sonnet-20240229 responses.

[03-13-2024] 🏆 We are preparing a leaderboard for CODAL-Bench, stay tuned!

[03-13-2024] 🔥 We release the first version of CodeUltraFeedback and CODAL-Bench.

Contact: If you have any inquiries or want to raise an issue, please feel free to contact:

About

Overview of CodeUltraFeedback

Overview of CodeUltraFeedback dataset construction (see Section II of our paper for more details).

Given the increasing coding capabilities of large language models (LLMs), the following question emerges:

How well do these capabilities align with the expectations of developers, particularly concerning non-functional requirements such as code readability, efficiency, and adherence to best practices?

We believe existing benchmarks relying on automated metrics and static analysis tools are insufficient and too rigid for evaluating the broader capabilities of LLMs. Instead, we believe LLM-as-a-judge offers a more nuanced strategy (or proxy to human evaluation) to evaluate LLMs while effectively considering the intricacies of natural and programming languages.

Our work features two main contributions: CodeUltraFeedback and CODAL-Bench, a dataset and benchmark for aligning LLMs to coding preferences and evaluating their alignment using LLM-as-a-judge.

CodeUltraFeedback is a preference dataset of complex coding instructions to align LLMs to coding preferences. It has an analogous construction procedure to UltraFeedback, featuring:

  • Complex instructions: CodeUltraFeedback is based on a 10k subset of MagiCoder Evol-Instruct comprising open domain complex coding instructions.
  • Coding preferences: CodeUltraFeedback includes 5 coding preferences, which are crucial to evaluate the broader capabilities of LLMs: instruction-following, code explanation, code complexity and efficiency, code readability, and coding style.
  • Large pool of LLMs: We use a large pool of 14 LLMs from 8 model families to generate responses to the 10k instructions to consider diverse writing and coding styles.
  • LLM-as-a-judge and AI feedback: We use GPT-3.5 as a judge for evaluating LLM responses, which annotates each response with both numerical and textual feedback. The AI feedback data can be leveraged for various applications, including model alignment through RLAIF, tuning a critic LLM, and more.

CODAL-Bench is a benchmark of 500 coding problems (100 per coding preference). We use LLM-as-a-judge with reference-guided single-answer grading using GPT-3.5 or GPT-4 to evaluate LLM alignment. The approach enables the judge LLM to provide consistent ratings and evaluate each LLM individually (similar to MT-Bench).

🚀 Getting Started

We provide all the source code implemented to build CodeUltraFeedback and evaluate LLMs on CODAL-Bench.

Important

We are currently working on instructions to:

  1. Build CodeUltraFeedback or extend the dataset
  2. Tune your own SFT and DPO LLMs
  3. Evaluate LLMs on CODAL-Bench

Models

Model Checkpoint Size CODAL-Bench GPT-3.5
(G-3.5, G-4)
CODAL-Bench GPT-4
(G-4)
HumanEval+
(k=1, k=10)
License
CodeLlama-7B-Instruct 🤗 HF Link 7B 6.00 / 5.46 4.72 37.9 / 60.4 Llama2
CodeLlama-7B-Instruct-SFT 🤗 HF Link 7B 6.51 / 5.83 5.84 51.2 / 82.9 Llama2
CodeLlama-7B-Instruct-DPO 🤗 HF Link 7B 7.15 / 6.79 5.08 42.3 / 80.5 Llama2
CodeLlama-7B-Instruct-SFT+DPO 🤗 HF Link 7B 7.36 / 7.08 5.85 43.1 / 75.6 Llama2

Datasets and Benchmark

📝 Citation

@misc{weyssow2024codeultrafeedback,
  title={CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences}, 
  author={Martin Weyssow and Aton Kamanda and Houari Sahraoui},
  year={2024},
  eprint={2403.09032},
  archivePrefix={arXiv},
  primaryClass={cs.SE}
}