Comprehensive benchmark suite for Turkish language models.
There is no standardized, comprehensive benchmark for Turkish LLMs. terazi fixes that. The name means "scales" in Turkish -- the instrument used for weighing and measuring.
| Category | Description | Tasks |
|---|---|---|
| terazi-core | General Turkish language understanding | Reading comprehension, common sense, grammar, translation, summarization |
| terazi-tool | Tool use and function calling | API calls, multi-step chains, parameter extraction, error recovery |
| terazi-fin | Financial Turkish | Document comprehension, sentiment, numerical reasoning, terminology |
| terazi-legal | Legal Turkish | Document comprehension, case reasoning, clause extraction, regulatory compliance |
Target: 500-1000 examples per category, 2000-4000 total.
pip install -e .Requires AWS credentials configured for Bedrock access (Claude Opus).
# Generate all categories (500 examples each)
terazi generate --category all --num-examples 500
# Generate a specific category
terazi generate --category core --num-examples 100# Evaluate a HuggingFace model
terazi eval --model meta-llama/Llama-3.1-8B-Instruct --categories core,tool
# Evaluate an API model
terazi eval --model gpt-4 --backend api --base-url https://api.openai.com/v1
# View results
terazi results --format tableterazi/
terazi/
generate/ Data generation pipeline (Bedrock/Opus)
eval/ Evaluation harness (runner, metrics, formats)
configs/ lm-evaluation-harness task configs
scripts/ Shell scripts for generation and eval
data/ Generated benchmark data (gitignored)
results/ Evaluation results (gitignored)
- Generate benchmark data (or download from HuggingFace:
selimozten/terazi) - Run the eval harness against your model
- Submit results via PR to be added to the leaderboard
MIT