llm-eval-cli

Reproducible eval suites for any OpenAI-compatible LLM endpoint — Markdown research reports + JSON traces, in one command.

llm-eval-cli is a small, opinionated CLI that lets you describe an evaluation suite in YAML (cases × models × graders), run it against any OpenAI-compatible endpoint (OpenAI, Xiaomi MiMo, DeepSeek, Together, vLLM, llama.cpp server, Ollama, …), and get back:

a Markdown research report (reports/<name>.md) with a leaderboard, per-case breakdown, and failure appendix, and
a JSON trace (reports/<name>.json) with every prompt, response, latency, token count, and grader verdict so the run is fully reproducible.

It is built for the use case of "I want to compare model A vs model B on my tasks before shipping", not for replacing big benchmark harnesses.

Why

Most eval tooling is either heavy (lm-eval-harness, helm) or coupled to a single provider. When you're choosing between, say, MiMo, GPT-4o-mini, and DeepSeek for a real product feature, you usually want:

A handful of your prompts, not MMLU.
Mixed grading: exact match for arithmetic, JSON-Schema for structured output, regex for format, and an LLM-as-judge for free-form answers.
A diff-able artifact you can paste into a PR or share with a teammate.

This tool gives you exactly that, in ~1k lines of Python.

Install

git clone https://github.com/zakahadi/llm-eval-cli.git
cd llm-eval-cli
python3 -m venv .venv && . .venv/bin/activate
pip install -e ".[dev]"

Requires Python ≥ 3.10.

Quick start

Set the API keys for whichever providers you want to hit:

export MIMO_API_KEY=sk-...
export OPENAI_API_KEY=sk-...

Validate a suite:

llm-eval validate suites/mimo_smoke.yaml

Run it (uses mimo as the LLM judge):

llm-eval run suites/mimo_smoke.yaml --judge mimo

Open reports/mimo_smoke-<timestamp>.md.

Suite format

A suite is a single YAML file. Minimal example:

name: my_suite
providers:
  - name: mimo
    base_url: https://api.xiaomimimo.com/v1
    api_key_env: MIMO_API_KEY
models:
  - mimo-v2.5-flash
options:
  temperature: 0.0
  concurrency: 4
  repeats: 1
cases:
  - id: arithmetic
    prompt: What is 23 * 17? Reply with the integer only.
    graders:
      - type: regex
        pattern: '^\s*391\s*$'

See suites/mimo_smoke.yaml for a full example with JSON-Schema validation, instruction-following checks, and an LLM-as-judge case.

Graders

Type	What it checks
`exact`	Stripped, optionally case-insensitive equality.
`contains`	Substring match.
`regex`	Python regex search.
`json_schema`	Response parses as JSON and matches a JSON-Schema (Draft 2020).
`llm_judge`	A separate model scores against a rubric and returns JSON.

Each grader returns score ∈ [0, 1] and a reason. The case-level score is a weighted mean across graders; the case "passes" only if every grader passes.

Reports

The Markdown report contains:

a leaderboard sorted by pass-rate, average score, and p50 latency,
a per-case table with verdict, score, latency, and the first failure reason, and
a failures appendix with the actual model output for the first 10 failures (super useful for debugging prompts).

The JSON report has the full trace: prompt, system message, response, per-grader score and reason, latencies, token counts, and timestamps.

Provider compatibility

Anything that implements POST /chat/completions with the OpenAI request / response shape works:

Xiaomi MiMo (https://api.xiaomimimo.com/v1) — including mimo-v2.5-*.
OpenAI, Azure OpenAI, DeepSeek, Together, Groq, OpenRouter.
Self-hosted: vLLM, llama.cpp server, Ollama (/v1), TGI's OpenAI-compatible endpoint.

If a provider needs custom headers (e.g. OpenAI-Project), put them under extra_headers in the suite.

Development

. .venv/bin/activate
pytest -q          # unit + offline integration tests (no network)
ruff check .       # lint
mypy src           # type-check

CI runs the same three commands on every push (see .github/workflows/ci.yml).

Roadmap

HTML report with sortable tables.
Streaming progress over WebSocket for long suites.
Built-in "drift" mode: re-run yesterday's suite and diff against the last JSON trace.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
docs		docs
src/llm_eval		src/llm_eval
suites		suites
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-eval-cli

Why

Install

Quick start

Suite format

Graders

Reports

Provider compatibility

Development

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-eval-cli

Why

Install

Quick start

Suite format

Graders

Reports

Provider compatibility

Development

Roadmap

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages