Skip to content

zakahadi/llm-eval-cli

Repository files navigation

llm-eval-cli

Reproducible eval suites for any OpenAI-compatible LLM endpoint — Markdown research reports + JSON traces, in one command.

llm-eval-cli is a small, opinionated CLI that lets you describe an evaluation suite in YAML (cases × models × graders), run it against any OpenAI-compatible endpoint (OpenAI, Xiaomi MiMo, DeepSeek, Together, vLLM, llama.cpp server, Ollama, …), and get back:

  • a Markdown research report (reports/<name>.md) with a leaderboard, per-case breakdown, and failure appendix, and
  • a JSON trace (reports/<name>.json) with every prompt, response, latency, token count, and grader verdict so the run is fully reproducible.

It is built for the use case of "I want to compare model A vs model B on my tasks before shipping", not for replacing big benchmark harnesses.

Why

Most eval tooling is either heavy (lm-eval-harness, helm) or coupled to a single provider. When you're choosing between, say, MiMo, GPT-4o-mini, and DeepSeek for a real product feature, you usually want:

  1. A handful of your prompts, not MMLU.
  2. Mixed grading: exact match for arithmetic, JSON-Schema for structured output, regex for format, and an LLM-as-judge for free-form answers.
  3. A diff-able artifact you can paste into a PR or share with a teammate.

This tool gives you exactly that, in ~1k lines of Python.

Install

git clone https://github.com/zakahadi/llm-eval-cli.git
cd llm-eval-cli
python3 -m venv .venv && . .venv/bin/activate
pip install -e ".[dev]"

Requires Python ≥ 3.10.

Quick start

  1. Set the API keys for whichever providers you want to hit:

    export MIMO_API_KEY=sk-...
    export OPENAI_API_KEY=sk-...
  2. Validate a suite:

    llm-eval validate suites/mimo_smoke.yaml
  3. Run it (uses mimo as the LLM judge):

    llm-eval run suites/mimo_smoke.yaml --judge mimo
  4. Open reports/mimo_smoke-<timestamp>.md.

Suite format

A suite is a single YAML file. Minimal example:

name: my_suite
providers:
  - name: mimo
    base_url: https://api.xiaomimimo.com/v1
    api_key_env: MIMO_API_KEY
models:
  - mimo-v2.5-flash
options:
  temperature: 0.0
  concurrency: 4
  repeats: 1
cases:
  - id: arithmetic
    prompt: What is 23 * 17? Reply with the integer only.
    graders:
      - type: regex
        pattern: '^\s*391\s*$'

See suites/mimo_smoke.yaml for a full example with JSON-Schema validation, instruction-following checks, and an LLM-as-judge case.

Graders

Type What it checks
exact Stripped, optionally case-insensitive equality.
contains Substring match.
regex Python regex search.
json_schema Response parses as JSON and matches a JSON-Schema (Draft 2020).
llm_judge A separate model scores against a rubric and returns JSON.

Each grader returns score ∈ [0, 1] and a reason. The case-level score is a weighted mean across graders; the case "passes" only if every grader passes.

Reports

The Markdown report contains:

  • a leaderboard sorted by pass-rate, average score, and p50 latency,
  • a per-case table with verdict, score, latency, and the first failure reason, and
  • a failures appendix with the actual model output for the first 10 failures (super useful for debugging prompts).

The JSON report has the full trace: prompt, system message, response, per-grader score and reason, latencies, token counts, and timestamps.

Provider compatibility

Anything that implements POST /chat/completions with the OpenAI request / response shape works:

  • Xiaomi MiMo (https://api.xiaomimimo.com/v1) — including mimo-v2.5-*.
  • OpenAI, Azure OpenAI, DeepSeek, Together, Groq, OpenRouter.
  • Self-hosted: vLLM, llama.cpp server, Ollama (/v1), TGI's OpenAI-compatible endpoint.

If a provider needs custom headers (e.g. OpenAI-Project), put them under extra_headers in the suite.

Development

. .venv/bin/activate
pytest -q          # unit + offline integration tests (no network)
ruff check .       # lint
mypy src           # type-check

CI runs the same three commands on every push (see .github/workflows/ci.yml).

Roadmap

  • HTML report with sortable tables.
  • Streaming progress over WebSocket for long suites.
  • Built-in "drift" mode: re-run yesterday's suite and diff against the last JSON trace.

License

MIT — see LICENSE.

About

A CLI developer tool for running custom evaluation suites to OpenAI-compatible APIs (MiMo, Claude, GPT, etc.), outputting Markdown research reports + JSON traces. Useful for developers who want to benchmark models before production. Suitable for Data/Research + Dev tools, and a natural fit using the MiMo API + Hermes Agent workflow.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages