GitHub - maitrix-org/de-arena: Official repository for Decentralized Arena via Collective LLM Intelligence

Decentralized Arena is a fully automated framework leveraging collective intelligence from all LLMs to evaluate each other. It provides:

Robust, unbiased: Decentralization avoids bias due to a single or a small committee of judge models, and is less gameable by overfitting the judge models. The more LLMs that participate in the arena, the more robust the evaluation becomes! Moreover, Decentralized Arena archives a very high correlation (97%) with Chatbot Arena on 60+ models
Automatic, easily scalable, and customizable to any evaluation dimensions: While Chatbot Arena is limited to evaluating a few dimensions due to the limited number of meaningful user votes it can gather, Decentralized Arena is fully automatic and can scale to an infinite number of dimensions. We also provide guidelines of automatically selecting dimension-specific queries for customized evaluation
Fast, instant ranking of new models: Thanks to the automation and the efficient binary-search ranking algorithm, we can instantly get the evaluation results of a new model, without needing to wait for weeks to gather user votes.
Transparent, fully reproducible: All algorithms, implementations, and inputs/outputs will be made open, making the results fully reproducible.
Trustworthy: Ultimately, with its robustness, strong alignment with existing human evaluation results, fine-grained dimensional analysis, and transparency, Decentralized Arena aims to provide a benchmark the community can trust.

News

Dec. 31, 2024: We added style control.
Oct. 10, 2024: We presented our De-Arena Blog & LeaderBoard, a fully automated framework leveraging collective intelligence from all LLMs to evaluate each other.

Introduction of the library

In De-Arena, we focus on the idea of decentralization that uses all LLMs as judges to vote on other model pairs, based on high-quality questions each dimension. It can reduce the cost of gathering human annotations, and also avoid the bias that may arise from relying on a single or a small number of judge models. To achieve it, we devise the coarse-to-fine incremental sort algorithm to efficiently rank a large number of LLMs, and the automatic question selection algorithm to select representative data for building new evaluation dimension.

Pipeline

Before Start, remember to change the model path in file utils_final.py.

API Example

This example provides the ranking of nine OpenAI API models. The results from the first step have already been stored in "mt_bench_responses", and some results from the second step have also been stored in "judgements_mt_bench".

1. bash response.bash
Parameter List:
- model_name: A list representing the models you want to rank.
- output_dir: The path where the models' responses are saved.
- path: The path to the question set.
- openai_api: Your OpenAI API key.
1. bash ranking_api_example.bash
Parameter List:
- openai_api: Your OpenAI API key.
- overall ids: Your question id.
- save_output_file_path: The path where your judge pairs are saved.
- judge_open_model: Your open source judge model list.
- judge_api_model: Your API judge model list.
- base_model_list: Your base model list.
- sort_model_list: A list representing the models you want to rank.

Open Source Example

This example provides the ranking of nine OpenAI API models. The results from the first step have already been stored in "mt_bench_responses", and results from the second step have also been stored in "judgements_mt_bench".

1. bash response.bash
Parameter List:
- model_name: A list representing the models you want to rank.
- output_dir: The path where the models' responses are saved.
- path: The path to the question set.
- openai_api: Your OpenAI API key.
1. bash judge.bash
In this process, to accelerate the evaluation, we first obtain the full sample judge results for each open-source model by using this script. The duration of this process depends on the number of GPUs available. However, make sure to modify the model_name parameter and run this script for all open-source models you want to use as judge model.

Parameter List:
- model_name: The model you run the judgements.
- model_names: A list representing the models you want to rank.
- path: The path to the question set.
- q_set: The dimension you are running.
1. bash ranking_open_example.bash
Parameter List:
- openai_api: Your OpenAI API key. (no need in this example)
- overall ids: Your question id.
- save_output_file_path: The path where your judge pairs are saved.
- judge_open_model: Your open source judge model list.
- judge_api_model: Your API judge model list.
- base_model_list: Your base model list.
- sort_model_list: A list representing the models you want to rank.

Installation

Make sure to use Python 3.10 or later.

conda create -n reasoners python=3.10
conda activate reasoners

Install from github

git clone https://github.com/Yanbin-Yin/De-Arena
cd De-Arena
pip install requirements.txt.

Citation

@misc{decentralized2024,
    title        = {Decentralized Arena via Collective LLM Intelligence: Building Automated, Robust, and Transparent LLM Evaluation for Numerous Dimensions},
    author       = {Yanbin Yin AND Zhen Wang AND Kun Zhou AND Xiangdong Zhang AND Shibo Hao AND Yi Gu AND Jieyuan Liu AND Somanshu Singla AND Tianyang Liu AND Xing, Eric P. AND Zhengzhong Liu AND Haojian Jin AND Zhiting Hu},
    year         = 2024,
    month        = 10,
    url          = {https://de-arena.maitrix.org/}
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
assets		assets
datasets		datasets
judgements_mt_bench		judgements_mt_bench
mt_bench_responses		mt_bench_responses
response_time		response_time
README.md		README.md
automatic_arena.py		automatic_arena.py
dearena-dataset.croissant.json		dearena-dataset.croissant.json
judge.bash		judge.bash
judge_responses.py		judge_responses.py
mt_bench_questions.jsonl		mt_bench_questions.jsonl
question_select.py		question_select.py
ranking_api_example.bash		ranking_api_example.bash
ranking_open_example.bash		ranking_open_example.bash
requirements.txt		requirements.txt
response.bash		response.bash
run_response.py		run_response.py
utils_final.py		utils_final.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

News

Introduction of the library

Pipeline

API Example

Open Source Example

Installation

Install from github

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

maitrix-org/de-arena

Folders and files

Latest commit

History

Repository files navigation

News

Introduction of the library

Pipeline

API Example

Open Source Example

Installation

Install from github

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages