Reproducible eval suites for any OpenAI-compatible LLM endpoint — Markdown research reports + JSON traces, in one command.
llm-eval-cli is a small, opinionated CLI that lets you describe an evaluation
suite in YAML (cases × models × graders), run it against any OpenAI-compatible
endpoint (OpenAI, Xiaomi MiMo, DeepSeek, Together, vLLM, llama.cpp server,
Ollama, …), and get back:
- a Markdown research report (
reports/<name>.md) with a leaderboard, per-case breakdown, and failure appendix, and - a JSON trace (
reports/<name>.json) with every prompt, response, latency, token count, and grader verdict so the run is fully reproducible.
It is built for the use case of "I want to compare model A vs model B on my tasks before shipping", not for replacing big benchmark harnesses.
Most eval tooling is either heavy (lm-eval-harness, helm) or coupled to a
single provider. When you're choosing between, say, MiMo, GPT-4o-mini, and
DeepSeek for a real product feature, you usually want:
- A handful of your prompts, not MMLU.
- Mixed grading: exact match for arithmetic, JSON-Schema for structured output, regex for format, and an LLM-as-judge for free-form answers.
- A diff-able artifact you can paste into a PR or share with a teammate.
This tool gives you exactly that, in ~1k lines of Python.
git clone https://github.com/zakahadi/llm-eval-cli.git
cd llm-eval-cli
python3 -m venv .venv && . .venv/bin/activate
pip install -e ".[dev]"Requires Python ≥ 3.10.
-
Set the API keys for whichever providers you want to hit:
export MIMO_API_KEY=sk-... export OPENAI_API_KEY=sk-...
-
Validate a suite:
llm-eval validate suites/mimo_smoke.yaml
-
Run it (uses
mimoas the LLM judge):llm-eval run suites/mimo_smoke.yaml --judge mimo
-
Open
reports/mimo_smoke-<timestamp>.md.
A suite is a single YAML file. Minimal example:
name: my_suite
providers:
- name: mimo
base_url: https://api.xiaomimimo.com/v1
api_key_env: MIMO_API_KEY
models:
- mimo-v2.5-flash
options:
temperature: 0.0
concurrency: 4
repeats: 1
cases:
- id: arithmetic
prompt: What is 23 * 17? Reply with the integer only.
graders:
- type: regex
pattern: '^\s*391\s*$'See suites/mimo_smoke.yaml for a full example with
JSON-Schema validation, instruction-following checks, and an LLM-as-judge case.
| Type | What it checks |
|---|---|
exact |
Stripped, optionally case-insensitive equality. |
contains |
Substring match. |
regex |
Python regex search. |
json_schema |
Response parses as JSON and matches a JSON-Schema (Draft 2020). |
llm_judge |
A separate model scores against a rubric and returns JSON. |
Each grader returns score ∈ [0, 1] and a reason. The case-level score is a
weighted mean across graders; the case "passes" only if every grader passes.
The Markdown report contains:
- a leaderboard sorted by pass-rate, average score, and p50 latency,
- a per-case table with verdict, score, latency, and the first failure reason, and
- a failures appendix with the actual model output for the first 10 failures (super useful for debugging prompts).
The JSON report has the full trace: prompt, system message, response, per-grader score and reason, latencies, token counts, and timestamps.
Anything that implements POST /chat/completions with the OpenAI request /
response shape works:
- Xiaomi MiMo (
https://api.xiaomimimo.com/v1) — includingmimo-v2.5-*. - OpenAI, Azure OpenAI, DeepSeek, Together, Groq, OpenRouter.
- Self-hosted: vLLM, llama.cpp server, Ollama (
/v1), TGI's OpenAI-compatible endpoint.
If a provider needs custom headers (e.g. OpenAI-Project), put them under
extra_headers in the suite.
. .venv/bin/activate
pytest -q # unit + offline integration tests (no network)
ruff check . # lint
mypy src # type-checkCI runs the same three commands on every push (see .github/workflows/ci.yml).
- HTML report with sortable tables.
- Streaming progress over WebSocket for long suites.
- Built-in "drift" mode: re-run yesterday's suite and diff against the last JSON trace.
MIT — see LICENSE.