This repo contains IFBench, which is a new, challenging benchmark for precise instruction following. Read the IFBench paper, accepted to NeurIPS 2025, D&B.
IFBench consists of two parts:
-
OOD Constraints: 58 new and challenging constraints, with corresponding verification functions. The constraint templates are combined with prompts from a held-out set of WildChat (Zhao et al. 2024).
-
(optionally) Multiturn Constraint Isolation in 2 turns: The prompt and the constraint are separated over two turns, i.e. the first turn is the user prompt and the model's response to the prompt, and the second turn is the constraint that modifies the initial prompt.
-
New IF-RLVR training constraints: 29 new and challenging constraints, with corresponding verification functions.
Install the requirements via the requirements.txt file. You need two jsonl files, one the IFBench_test.jsonl file (in the data folder) and one your file with eval prompts and completions (see sample_output.jsonl as an example). Then run:
python3 -m run_eval --input_data=IFBench_test.jsonl --input_response_data=sample_output.jsonl --output_dir=eval
Note: In the paper we generally report the prompt-level loose accuracy of IFBench. When we generate for evaluation, we use a temperature of 0 and adjust the maximum generated tokens depending on the model type, i.e. for thinking models we allow to generate more tokens and we then process the output to extract the answer without the reasoning chains.
You can find our released datasets in this collection, which contains the test data, the multi-turn test data and the IF-RLVR training data.
We also release our IF-RLVR code, as part of open-instruct. You can run this GRPO script, using our training data. This is an example command.
The new training constraints and verification functions are here: https://github.com/allenai/open-instruct/tree/main/open_instruct/IFEvalG
| Rank | Model | IFBench Score | IFEval Score |
|---|---|---|---|
| 🥇 1 | OpenAI o3 | 69.3 | 95.0 |
| 🥈 2 | Qwen2.5 Base + IF-RLVR | 53.7 | 87.8 |
| 🥉 3 | Llama 3.1 Base + IF-RLVR | 52.7 | 88.2 |
| 4 | Gemini 2.5 Pro | 52.3 | 65.4 |
| 5 | Qwen 2.5 Instruct + IF-RLVR | 48.7 | 89.1 |
| 6 | OLMo2 Base + IF-RLVR | 47.3 | 70.4 |
| 7 | OLMo2 Instruct + IF-RLVR | 44.7 | 74.5 |
| 7 | Tulu3 DPO + IF-RLVR | 43.3 | 92.2 |
| 9 | Claude 4 Sonnet | 42.3 | 91.3 |
| 10 | DeepSeek R1 | 38.0 | 86.13 |
| 11 | Qwen 3 32B | 37.3 | 85.6 |
| 12 | Qwen 3 8B | 35.0 | 86.3 |
Sorted by IFBench score (higher is better) If you want your model added to the leaderboard, please create a pull request or email me!
This codebase is licensed under Apache 2.0 as given in LICENSE.
The data is licensed under ODC-BY-1.0. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines. The dataset includes output data generated from third party models that are subject to separate terms governing their use.
Parts of IFBench are built upon and extend IFEval (Zhou et al. 2023) and we would like to thank them for their great work!
If you used this repository or our models, please cite our work:
@misc{pyatkin2025generalizing,
title={Generalizing Verifiable Instruction Following},
author={Valentina Pyatkin and Saumya Malik and Victoria Graf and Hamish Ivison and Shengyi Huang and Pradeep Dasigi and Nathan Lambert and Hannaneh Hajishirzi},
year={2025},
journal={Advances in Neural Information Processing Systems},
volume={38},
year={2025}
}