Generalizing Verifiable Instruction Following

This repo contains IFBench, which is a new, challenging benchmark for precise instruction following. Read the IFBench paper, accepted to NeurIPS 2025, D&B.

IFBench

IFBench consists of two parts:

OOD Constraints: 58 new and challenging constraints, with corresponding verification functions. The constraint templates are combined with prompts from a held-out set of WildChat (Zhao et al. 2024).
(optionally) Multiturn Constraint Isolation in 2 turns: The prompt and the constraint are separated over two turns, i.e. the first turn is the user prompt and the model's response to the prompt, and the second turn is the constraint that modifies the initial prompt.
New IF-RLVR training constraints: 29 new and challenging constraints, with corresponding verification functions.

How to run the evaluation

Install the requirements via the requirements.txt file. You need two jsonl files, one the IFBench_test.jsonl file (in the data folder) and one your file with eval prompts and completions (see sample_output.jsonl as an example). Then run:

python3 -m run_eval --input_data=IFBench_test.jsonl --input_response_data=sample_output.jsonl --output_dir=eval

Note: In the paper we generally report the prompt-level loose accuracy of IFBench. When we generate for evaluation, we use a temperature of 0 and adjust the maximum generated tokens depending on the model type, i.e. for thinking models we allow to generate more tokens and we then process the output to extract the answer without the reasoning chains.

Released Datasets

You can find our released datasets in this collection, which contains the test data, the multi-turn test data and the IF-RLVR training data.

RLVR for Precise Instruction Following

We also release our IF-RLVR code, as part of open-instruct. You can run this GRPO script, using our training data. This is an example command.

The new training constraints and verification functions are here: https://github.com/allenai/open-instruct/tree/main/open_instruct/IFEvalG

📊 Model Performance Leaderboard

Rank	Model	IFBench Score	IFEval Score
🥇 1	OpenAI o3	69.3	95.0
🥈 2	Qwen2.5 Base + IF-RLVR	53.7	87.8
🥉 3	Llama 3.1 Base + IF-RLVR	52.7	88.2
4	Gemini 2.5 Pro	52.3	65.4
5	Qwen 2.5 Instruct + IF-RLVR	48.7	89.1
6	OLMo2 Base + IF-RLVR	47.3	70.4
7	OLMo2 Instruct + IF-RLVR	44.7	74.5
7	Tulu3 DPO + IF-RLVR	43.3	92.2
9	Claude 4 Sonnet	42.3	91.3
10	DeepSeek R1	38.0	86.13
11	Qwen 3 32B	37.3	85.6
12	Qwen 3 8B	35.0	86.3

Sorted by IFBench score (higher is better) If you want your model added to the leaderboard, please create a pull request or email me!

Licensing

This codebase is licensed under Apache 2.0 as given in LICENSE.

The data is licensed under ODC-BY-1.0. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines. The dataset includes output data generated from third party models that are subject to separate terms governing their use.

Acknowledgements

Parts of IFBench are built upon and extend IFEval (Zhou et al. 2023) and we would like to thank them for their great work!

Citation

If you used this repository or our models, please cite our work:

@misc{pyatkin2025generalizing,
   title={Generalizing Verifiable Instruction Following}, 
   author={Valentina Pyatkin and Saumya Malik and Victoria Graf and Hamish Ivison and Shengyi Huang and Pradeep Dasigi and Nathan Lambert and Hannaneh Hajishirzi},
   year={2025},
  journal={Advances in Neural Information Processing Systems},
  volume={38},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
__pycache__		__pycache__
data		data
eval		eval
LICENSE		LICENSE
Precise_IF_Generalization_Abilities.pdf		Precise_IF_Generalization_Abilities.pdf
README.md		README.md
evaluation_lib.py		evaluation_lib.py
instructions.py		instructions.py
instructions_registry.py		instructions_registry.py
instructions_test.py		instructions_test.py
instructions_util.py		instructions_util.py
requirements.txt		requirements.txt
run_eval.py		run_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Generalizing Verifiable Instruction Following

IFBench

How to run the evaluation

Released Datasets

RLVR for Precise Instruction Following

📊 Model Performance Leaderboard

Licensing

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 6

Languages

License

allenai/IFBench

Folders and files

Latest commit

History

Repository files navigation

Generalizing Verifiable Instruction Following

IFBench

How to run the evaluation

Released Datasets

RLVR for Precise Instruction Following

📊 Model Performance Leaderboard

Licensing

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Languages

Packages