|🤗 Dataset | [🏆Leaderboard(SOON!)] | 📖 Original Dataset Paper |
This repository contains the evaluation code for the Turkish-translated version of the paper "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark"
We introduce MMLU-Pro-TR, a benchmark that is the translated version of MMLU-Pro, an enhanced benchmark designed to evaluate language understanding models across broader and more challenging tasks. Building on the Massive Multitask Language Understanding (MMLU) dataset, MMLU-Pro integrates more challenging, reasoning-focused questions and increases the answer choices per question from four to ten, significantly raising the difficulty and reducing the chance of success through random guessing. MMLU-Pro comprises over 12,000 rigorously curated questions from academic exams and textbooks, spanning 14 diverse domains including Biology, Business, Chemistry, Computer Science, Economics, Engineering, Health, History, Law, Math, Philosophy, Physics, Psychology, and Others.
Please refer to huggingface 🤗 Dataset and 🤗 Dataset-TR for more details.
To run local inference, modify the model name in the following script and execute it:
sh run.sh| Models | Original MMLU Score | MMLU Pro Score | Drop |
|---|---|---|---|
| Metin/LLaMA-3-8B-Instruct-TR-DPO | 49.71 | 27.00 | 22.71 |
| ytu-ce-cosmos/Turkish-Llama-8b-Instruct-v0.1 | 51.75 | 23.90 | 27.85 |
| VeriUS/VeriUS-LLM-8b-v0.2 | 48.81 | 23.23 | 25.58 |
| Orbina/Orbita-v0.1 | 49.51 | 22.95 | 26.56 |
| KOCDIGITAL/Kocdigital-LLM-8b-v0.1 | 47.35 | 21.83 | 25.52 |
| meta-llama/Meta-Llama-3-8B-Instruct | 49.29 | 20.93 | 28.36 |
| NousResearch/Meta-Llama-3-8B | 49.29 | 20.93 | 28.36 |
| curiositytech/MARS | 46.73 | 20.81 | 25.92 |
| Trendyol/Trendyol-LLM-7b-chat-v1.8 | 41.91 | 18.15 | 23.76 |
| TURKCELL/Turkcell-LLM-7b-v1 | 39.03 | 17.15 | 21.88 |
| ytu-ce-cosmos/turkish-gpt2-large-750m-instruct-v0.1 | 26.56 | 10.88 | 15.67 |
For more details on various models and their accuracy across different subjects, please visit our [Leaderboard(SOON!)].
BibTeX:
@misc{wang2024mmlupro,
title={MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark},
author={Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen},
year={2024},
eprint={2406.01574},
archivePrefix={arXiv},
primaryClass={cs.CL}
}