terazi

Comprehensive benchmark suite for Turkish language models.

Why

There is no standardized, comprehensive benchmark for Turkish LLMs. terazi fixes that. The name means "scales" in Turkish -- the instrument used for weighing and measuring.

Benchmark Categories

Category	Description	Tasks
terazi-core	General Turkish language understanding	Reading comprehension, common sense, grammar, translation, summarization
terazi-tool	Tool use and function calling	API calls, multi-step chains, parameter extraction, error recovery
terazi-fin	Financial Turkish	Document comprehension, sentiment, numerical reasoning, terminology
terazi-legal	Legal Turkish	Document comprehension, case reasoning, clause extraction, regulatory compliance

Target: 500-1000 examples per category, 2000-4000 total.

Quick Start

Install

pip install -e .

Generate Benchmark Data

Requires AWS credentials configured for Bedrock access (Claude Opus).

# Generate all categories (500 examples each)
terazi generate --category all --num-examples 500

# Generate a specific category
terazi generate --category core --num-examples 100

Run Evaluation

# Evaluate a HuggingFace model
terazi eval --model meta-llama/Llama-3.1-8B-Instruct --categories core,tool

# Evaluate an API model
terazi eval --model gpt-4 --backend api --base-url https://api.openai.com/v1

# View results
terazi results --format table

Project Structure

terazi/
  terazi/
    generate/     Data generation pipeline (Bedrock/Opus)
    eval/         Evaluation harness (runner, metrics, formats)
  configs/        lm-evaluation-harness task configs
  scripts/        Shell scripts for generation and eval
  data/           Generated benchmark data (gitignored)
  results/        Evaluation results (gitignored)

Adding Your Model

Generate benchmark data (or download from HuggingFace: selimozten/terazi)
Run the eval harness against your model
Submit results via PR to be added to the leaderboard

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
configs		configs
data		data
docs		docs
results		results
scripts		scripts
terazi		terazi
.gitignore		.gitignore
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

terazi

Why

Benchmark Categories

Quick Start

Install

Generate Benchmark Data

Run Evaluation

Project Structure

Adding Your Model

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

terazi

Why

Benchmark Categories

Quick Start

Install

Generate Benchmark Data

Run Evaluation

Project Structure

Adding Your Model

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages