Evaluation is crucial for the development of information retrieval models. In recent years, a series of milestone works have been introduced to the community, such as MSMARCO, Natural Question (open-domain QA), MIRACL (multilingual retrieval), BEIR and MTEB (general-domain zero-shot retrieval). However, the existing benchmarks are severely limited in the following perspectives.
- Incapability of dealing with new domains. All of the existing benchmarks are static, which means they are established for the pre-defined domains based on human labeled data. Therefore, they are incapable of dealing with new domains which are interested by the users.
- Potential risk of over-fitting and data leakage. The existing retrievers are intensively fine-tuned in order to achieve strong performances on popular benchmarks, like BEIR and MTEB. Despite that these benchmarks are initially designed for zero-shot evaluation of O.O.D. Evaluation, the in-domain training data is widely used during the fine-tuning process. What is worse, given the public availability of the existing evaluation datasets, the testing data could be falsely mixed into the retrievers' training set by mistake.
- 🤖 Automated. The testing data is automatically generated by large language models without human intervention. Therefore, it is able to instantly support the evaluation of new domains at a very small cost. Besides, the new testing data is almost impossible to be covered by the training sets of any existing retrievers.
- 🔍 Retrieval and RAG-oriented. The new benchmark is dedicated to the evaluation of retrieval performance. In addition to the typical evaluation scenarios, like open-domain question answering or paraphrase retrieval, the new benchmark also incorporates a new setting called inner-document retrieval which is closely related with today's LLM and RAG applications. In this new setting, the model is expected to retrieve the relevant chunks of a very long documents, which contain the critical information to answer the input question.
- 🔄 Heterogeneous and Dynamic. The testing data is generated w.r.t. diverse and constantly augmented domains and languages (i.e. Multi-domain, Multi-lingual). As a result, it is able to provide an increasingly comprehensive evaluation benchmark for the community developers.
We plan to release new test datasets on regular basis. The latest version is AIR-Bench_24.05
.
Version | Release Date | # of domains | # of languages | # of datasets | Details |
---|---|---|---|---|---|
AIR-Bench_24.05 |
Oct 17, 2024 | 9 [1] | 13 [2] | 69 | here |
AIR-Bench_24.04 |
May 21, 2024 | 8 [3] | 2 [4] | 28 | here |
[1] wiki, web, news, healthcare, law, finance, arxiv, book, science.
[2] en, zh, es, fr, de, ru, ja, ko, ar, fa, id, hi, bn (English, Chinese, Spanish, French, German, Russian, Japanese, Korean, Arabic, Persian, Indonesian, Hindi, Bengali).
[3] wiki, web, news, healthcare, law, finance, arxiv, book.
[4] en, zh (English, Chinese).
For the differences between different versions, please refer to here.
You could check out the results at AIR-Bench Leaderboard. Detailed results are available in eval_results.
Some brief analysis results are available here. The technical report is coming soon. Please stay tuned for updates!
This repo is used to maintain the codebases for running AIR-Bench evaluation. To run the evaluation, please install air-benchmark
.
pip install air-benchmark
Refer to the steps below to run evaluations and submit the results to the leaderboard (refer to here for more detailed information).
-
Run evaluations
- See the scripts to run evaluations on AIR-Bench for your models.
-
Submit search results (Only for test set)
-
Package the output files
- As for the results without reranking models,
cd scripts python zip_results.py \ --results_dir search_results \ --retriever_name [YOUR_RETRIEVAL_MODEL] \ --save_dir search_results
- As for the results with reranking models
cd scripts python zip_results.py \ --results_dir search_results \ --retriever_name [YOUR_RETRIEVAL_MODEL] \ --reranker_name [YOUR_RERANKING_MODEL] \ --save_dir search_results
-
Upload the output
.zip
and fill in the model information at AIR-Bench Leaderboard
-
Documentation | |
---|---|
🏭 Pipeline | The data generation pipeline of AIR-Bench |
📋 Tasks | Overview of available tasks in AIR-Bench |
📈 Leaderboard | The interactive leaderboard of AIR-Bench |
🚀 Submit | Information related to how to submit a model to AIR-Bench |
🤝 Contributing | How to contribute to AIR-Bench |
This work is inspired by MTEB and BEIR. Many thanks for the early feedbacks from @tomaarsen, @Muennighoff, @takatost, @chtlp.
The technical report is coming soon. Please stay tuned for updates!