Skip to content

EvalHub

EvalHub mascot

AI evaluation,
done right.

EvalHub orchestrates reproducible LLM evaluations on Kubernetes — broad benchmark coverage, curated domain collections, and native CI/CD support.

Terminal window
# Install the CLI
pip install "eval-hub-sdk[cli]"
# Connect to your cluster
evalhub config set base_url https://evalhub.apps.cluster.example.com
evalhub config set token $(kubectl create token evalhub-sa -n evalhub)
# Run an evaluation
evalhub eval run \
--model-url https://llama3.example.com/v1 \
--model-name meta-llama/Llama-3.2-8B-Instruct \
--benchmark mmlu --wait
# Job submitted: eval-a1b2c3d4
# ✓ completed mmlu acc 0.723
# ✓ completed hellaswag acc_norm 0.801

What’s inside

Multi-framework benchmarks

Evaluate against MMLU, HellaSwag, GSM8K, and more via lm-evaluation-harness, Garak, GuideLLM, and LightEval — all from a single CLI command.

Domain collections

Curated benchmark sets for healthcare, finance, and automotive safety with weighted scoring and configurable pass thresholds.

Kubernetes-native

Each evaluation runs as an isolated Kubernetes Job. Multi-tenant, OCI artifact storage, and MLflow experiment tracking built in.

CI/CD ready

Block on results with —wait, export JSON for downstream tooling, and automate with the Python SDK or REST API.


Explore AI Evaluations with EvalHub