MMLU-Pro-TR

|🤗 Dataset | [🏆Leaderboard(SOON!)] | 📖 Original Dataset Paper |

This repository contains the evaluation code for the Turkish-translated version of the paper "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark"

Introduction

We introduce MMLU-Pro-TR, a benchmark that is the translated version of MMLU-Pro, an enhanced benchmark designed to evaluate language understanding models across broader and more challenging tasks. Building on the Massive Multitask Language Understanding (MMLU) dataset, MMLU-Pro integrates more challenging, reasoning-focused questions and increases the answer choices per question from four to ten, significantly raising the difficulty and reducing the chance of success through random guessing. MMLU-Pro comprises over 12,000 rigorously curated questions from academic exams and textbooks, spanning 14 diverse domains including Biology, Business, Chemistry, Computer Science, Economics, Engineering, Health, History, Law, Math, Philosophy, Physics, Psychology, and Others.

Dataset Creation

Please refer to huggingface 🤗 Dataset and 🤗 Dataset-TR for more details.

Evaluation

To run local inference, modify the model name in the following script and execute it:

sh run.sh

🏆 Mini-Leaderboard

Models	Original MMLU Score	MMLU Pro Score	Drop
Metin/LLaMA-3-8B-Instruct-TR-DPO	49.71	27.00	22.71
ytu-ce-cosmos/Turkish-Llama-8b-Instruct-v0.1	51.75	23.90	27.85
VeriUS/VeriUS-LLM-8b-v0.2	48.81	23.23	25.58
Orbina/Orbita-v0.1	49.51	22.95	26.56
KOCDIGITAL/Kocdigital-LLM-8b-v0.1	47.35	21.83	25.52
meta-llama/Meta-Llama-3-8B-Instruct	49.29	20.93	28.36
NousResearch/Meta-Llama-3-8B	49.29	20.93	28.36
curiositytech/MARS	46.73	20.81	25.92
Trendyol/Trendyol-LLM-7b-chat-v1.8	41.91	18.15	23.76
TURKCELL/Turkcell-LLM-7b-v1	39.03	17.15	21.88
ytu-ce-cosmos/turkish-gpt2-large-750m-instruct-v0.1	26.56	10.88	15.67

For more details on various models and their accuracy across different subjects, please visit our [Leaderboard(SOON!)].

Citation

BibTeX:

@misc{wang2024mmlupro,
      title={MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark}, 
      author={Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen},
      year={2024},
      eprint={2406.01574},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
cot_prompt_lib		cot_prompt_lib
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compute_accuracy.py		compute_accuracy.py
evaluate_from_local.py		evaluate_from_local.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMLU-Pro-TR

Introduction

Dataset Creation

Evaluation

🏆 Mini-Leaderboard

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MMLU-Pro-TR

Introduction

Dataset Creation

Evaluation

🏆 Mini-Leaderboard

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages