GitHub - AIR-Bench/AIR-Bench: [ACL 2025] AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark

Motivation | Features | Documentation | Leaderboard | Citing

☁️ News

[2025/05/15] We are excited to announce that AIR-Bench has been accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025, Main Conference)! 🎉
[2024/12/17] Release our preprint paper on arXiv: https://arxiv.org/abs/2412.13102. 🎉
[2024/10/17] Release version AIR-Bench_24.05, which includes 9 domains and 13 languages on 69 datasets. The new version now includes both dev and test sets, where the golden labels of the dev set are made publicly available to help developers to perform evaluation by themselves. 🔥
[2024/05/21] Release the initial version of AIR-Bench, AIR-Bench_24.04, which includes 8 domains and 2 languages on 28 datasets. 🔥

☁️ Motivation

Evaluation is crucial for the development of information retrieval models. In recent years, a series of milestone works have been introduced to the community, such as MSMARCO, Natural Question (open-domain QA), MIRACL (multilingual retrieval), BEIR and MTEB (general-domain zero-shot retrieval). However, the existing benchmarks are severely limited in the following perspectives.

Incapability of dealing with new domains. All of the existing benchmarks are static, which means they are established for the pre-defined domains based on human labeled data. Therefore, they are incapable of dealing with new domains which are interested by the users.
Potential risk of over-fitting and data leakage. The existing retrievers are intensively fine-tuned in order to achieve strong performances on popular benchmarks, like BEIR and MTEB. Despite that these benchmarks are initially designed for zero-shot evaluation of O.O.D. Evaluation, the in-domain training data is widely used during the fine-tuning process. What is worse, given the public availability of the existing evaluation datasets, the testing data could be falsely mixed into the retrievers' training set by mistake.

☁️ Features

🤖 Automated. The testing data is automatically generated by large language models without human intervention. Therefore, it is able to instantly support the evaluation of new domains at a very small cost. Besides, the new testing data is almost impossible to be covered by the training sets of any existing retrievers.
🔍 Retrieval and RAG-oriented. The new benchmark is dedicated to the evaluation of retrieval performance. In addition to the typical evaluation scenarios, like open-domain question answering or paraphrase retrieval, the new benchmark also incorporates a new setting called inner-document retrieval which is closely related with today's LLM and RAG applications. In this new setting, the model is expected to retrieve the relevant chunks of a very long documents, which contain the critical information to answer the input question.
🔄 Heterogeneous and Dynamic. The testing data is generated w.r.t. diverse and constantly augmented domains and languages (i.e. Multi-domain, Multi-lingual). As a result, it is able to provide an increasingly comprehensive evaluation benchmark for the community developers.

☁️ Versions

We plan to release new test datasets on regular basis. The latest version is AIR-Bench_24.05.

Version	Release Date	# of domains	# of languages	# of datasets	Details
`AIR-Bench_24.05`	Oct 17, 2024	9 ^[1]	13 ^[2]	69	here
`AIR-Bench_24.04`	May 21, 2024	8 ^[3]	2 ^[4]	28	here

[1] wiki, web, news, healthcare, law, finance, arxiv, book, science.

[2] en, zh, es, fr, de, ru, ja, ko, ar, fa, id, hi, bn (English, Chinese, Spanish, French, German, Russian, Japanese, Korean, Arabic, Persian, Indonesian, Hindi, Bengali).

[3] wiki, web, news, healthcare, law, finance, arxiv, book.

[4] en, zh (English, Chinese).

For the differences between different versions, please refer to here.

☁️ Results

You could check out the results at AIR-Bench Leaderboard. Detailed results are available in eval_results.

For the detailed analysis of the evaluation results, please refer to our technical report.

☁️ Usage

Installation

This repo is used to maintain the codebases for running AIR-Bench evaluation. To run the evaluation, please install air-benchmark.

pip install air-benchmark

Evaluations

Refer to the steps below to run evaluations and submit the results to the leaderboard (refer to here for more detailed information).

Run evaluations
- See the scripts to run evaluations on AIR-Bench for your models.

Submit search results (Only for test set)

Package the output files

As for the results without reranking models,

cd scripts
python zip_results.py \
--results_dir search_results \
--retriever_name [YOUR_RETRIEVAL_MODEL] \
--save_dir search_results

As for the results with reranking models

cd scripts
python zip_results.py \
--results_dir search_results \
--retriever_name [YOUR_RETRIEVAL_MODEL] \
--reranker_name [YOUR_RERANKING_MODEL] \
--save_dir search_results

Upload the output .zip and fill in the model information at AIR-Bench Leaderboard

☁️ Documentation

Documentation
🏭 Pipeline	The data generation pipeline of AIR-Bench
📋 Tasks	Overview of available tasks in AIR-Bench
📈 Leaderboard	The interactive leaderboard of AIR-Bench
🚀 Submit	Information related to how to submit a model to AIR-Bench
🤝 Contributing	How to contribute to AIR-Bench

License

The code in this repository is licensed under the MIT license.

The testing data in AIR-Bench is licensed under the CC BY-NC-SA 4.0 license. This means:

You are free to share and adapt the data for non-commercial purposes.
You must give appropriate credit and indicate if changes were made.

If you need to use the testing data in AIR-Bench, you must understand and agree to the following: The testing data in AIR-Bench may only be used for evaluation purposes and cannot be used for any commercial or other purposes. The user assumes all effects arising from any other use and dissemination of the data.

☁️ Acknowledgement

This work is inspired by MTEB and BEIR. Many thanks for the early feedbacks from @tomaarsen, @Muennighoff, @takatost, @chtlp.

☁️ Citing

If you find this repository useful, please consider giving a star ⭐ and citation

@misc{chen2024airbench,
      title={AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark}, 
      author={Jianlyu Chen and Nan Wang and Chaofan Li and Bo Wang and Shitao Xiao and Han Xiao and Hao Liao and Defu Lian and Zheng Liu},
      year={2024},
      eprint={2412.13102},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2412.13102}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
.github/workflows		.github/workflows
air_benchmark		air_benchmark
docs		docs
examples		examples
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Motivation | Features | Documentation | Leaderboard | Citing

☁️ News

☁️ Motivation

☁️ Features

☁️ Versions

☁️ Results

☁️ Usage

Installation

Evaluations

☁️ Documentation

License

☁️ Acknowledgement

☁️ Citing

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

AIR-Bench/AIR-Bench

Folders and files

Latest commit

History

Repository files navigation

Motivation | Features | Documentation | Leaderboard | Citing

☁️ News

☁️ Motivation

☁️ Features

☁️ Versions

☁️ Results

☁️ Usage

Installation

Evaluations

☁️ Documentation

License

☁️ Acknowledgement

☁️ Citing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages