Evaluating the Generative Capabilities of LLMs in Russian.
Benchmark and a family of LM-as-a-Judge models.
Welcome to POLLUX โ an open-source project dedicated to evaluating the generative capabilities of modern large language models (LLMs) in Russian.
Our comprehensive evaluation framework is built on three foundational pillars. First, we provide carefully developed ๐ taxonomies that systematically categorize both generative tasks and evaluation criteria. Second, our meticulously crafted ๐ benchmark comprises 2,100 unique, manually created instructions paired with 471,515 detailed point criteria assessments. Finally, POLLUX features a specialized โ๏ธ family of LLM-based judges that automate the evaluation process, enabling scalable and systematic assessment of model outputs across all task categories.
โ ๐งญ Explore the benchmark on the project page.
โ ๐ค See Hugging Face collection for the dataset and the models.
-
๐ 152 diverse tasks: Covering open-ended generation, text-to-text transformation, information-seeking, and code-related prompts. The task taxonomy is grounded in analysis of real-world user queries.
-
๐ก๏ธ 66 unique evaluation criteria: A rich set of non-overlapping fine-grained metrics โ ranging from surface-level quality (e.g. absence of artifacts) to higher-level abilities like reasoning and creativity. Each criterion comes with a clearly defined evaluation scale.
-
๐ Three difficulty levels: Tasks are organized into easy, medium, and hard tiers to support targeted model diagnostics.
-
๐ฉ๐ผโ๐ Expert-curated tasks: All tasks and criteria are designed from scratch by domain experts to ensure quality and relevance. All instructions and criteria annotations are similarly developed and reviewed by experts panels to maintain consistent standards throughout the evaluation process.
-
๐ค LLM-based evaluators: A suite of judge models (7B and 32B) trained to assess responses against specific criteria and generate score justifications. Supports custom criteria and evaluation scales via flexible input formatting (beta).
Score model outputs with POLLUX judges: demo.ipynb
To reproduce the evaluation results, please refer to the src/inference.py file:
git clone https://github.com/ai-forever/POLLUX.git
cd POLLUX
pip install -r requirements.txt
python ./src/inference.py --test_path ai-forever/POLLUX --template_path src/data_utils/test_prompt_template_ru.yaml --num_proc 1 inference_offline_vllm --model_path ai-forever/pollux-judge-7b --tokenizer_path ai-forever/pollux-judge-7b --tensor_parallel_size 1 --answer_path pollux_judge_7b.json
python ./src/inference.py --test_path ai-forever/POLLUX --template_path src/data_utils/test_prompt_template_ru.yaml --num_proc 1 compute_metrics --answer_path logs/pollux_judge_7b.json
pollux/
โโโ images/ # project logo
โโโ metainfo/ # benchmark metadata
โโโ clustering_demo.ipynb # user logs analysis
โโโ src/ # inference tools
โโโ src/inference.py # reproduce evaluation
โโโ LICENSE # license
โโโ demo.ipynb # inference demo
The POLLUX benchmark is built upon comprehensive taxonomies of generative tasks and evaluation criteria. Our taxonomy of generative tasks encompasses 35 general task groups organized across two hierarchical levels (functional styles/substyles and genres), covering a total of 152 distinct tasks. ๐
Our taxonomy of evaluation criteria features five comprehensive categories that assess:
- ๐ General & Critical: Core syntactic, lexical, and semantic text properties
- ๐ฏ Domain-specific: Properties tied to specialized functional styles
- โ Task-specific: Task-oriented markers and requirements
- ๐ญ Subjective: Human preferences and subjective opinions
โ๐ Benchmark Scale & Coverage
The benchmark contains 2,100 unique instructions evenly distributed across all 35 task groups, with three complexity levels per group. Each instruction includes responses from 7 top-tier LLMs:
- ๐ค OpenAI o1 & GPT-4o
- ๐ง Claude 3.5 Sonnet
- ๐ฆ Llama 405B
- โก๏ธ T-pro-it-1.0
- ๐ YandexGPT 4 Pro
- ๐ GigaChat Max
This results in 11,500 total responses across the benchmark! ๐
โ๐ฌ Expert Evaluation Process
Every response is scrupulously evaluated using a tailored criteria set combining:
- Critical, Subjective, and General criteria
- Relevant Domain- and Task-specific criteria
With at least two expert evaluators per criterion, we've collected:
- 471,000+ individual criteria estimates with textual rationales โ๏ธ
- 161,076 aggregate (over overlap) numerical scores ๐
โ๐ Access & Exploration
Ready to dive in? Access the benchmark on its home page and explore the data through our interactive demo! ๐ฎ
POLLUX includes a family of LLM-based judges, trained to evaluate model outputs against scale-based criteria. The judges are designed to be flexible and can be adapted to different evaluation scales and criteria.
We provide two versions of the judges:
- 7B (T-lite-based): A smaller model that is faster and more efficient, suitable for quick evaluations and lower resource environments.
- 32B (T-pro-based): A larger model that provides more accurate evaluations, suitable for high-performance environments.
There are two architecture types in both sizes:
- seq2seq: A sequence-to-sequence model that generates a score and its justification in a decoder-only manner as a joint text output.
- regression (-r in HF model identifiers): A regression model that outputs a numeric score from an added regression head and generates the score justification in a decoder-only manner.
This project is licensed under the MIT License. See LICENSE for details.
If you use POLLUX in your research, please cite the following paper:
@misc{
martynov2025eyejudgementdissectingevaluation,
title={Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX},
author={Nikita Martynov and Anastasia Mordasheva and Dmitriy Gorbetskiy and Danil Astafurov and Ulyana Isaeva and Elina Basyrova and Sergey Skachkov and Victoria Berestova and Nikolay Ivanov and Valeriia Zanina and Alena Fenogenova},
year={2025},
eprint={2505.24616},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.24616}
}
Made with โค๏ธ by the POLLUX team