Multi-framework benchmarks
Evaluate against MMLU, HellaSwag, GSM8K, and more via lm-evaluation-harness, Garak, GuideLLM, and LightEval — all from a single CLI command.
EvalHub orchestrates reproducible LLM evaluations on Kubernetes — broad benchmark coverage, curated domain collections, and native CI/CD support.
# Install the CLIpip install "eval-hub-sdk[cli]"
# Connect to your clusterevalhub config set base_url https://evalhub.apps.cluster.example.comevalhub config set token $(kubectl create token evalhub-sa -n evalhub)
# Run an evaluationevalhub eval run \--model-url https://llama3.example.com/v1 \--model-name meta-llama/Llama-3.2-8B-Instruct \--benchmark mmlu --wait
# Job submitted: eval-a1b2c3d4# ✓ completed mmlu acc 0.723# ✓ completed hellaswag acc_norm 0.801What’s inside
Multi-framework benchmarks
Evaluate against MMLU, HellaSwag, GSM8K, and more via lm-evaluation-harness, Garak, GuideLLM, and LightEval — all from a single CLI command.
Domain collections
Curated benchmark sets for healthcare, finance, and automotive safety with weighted scoring and configurable pass thresholds.
Kubernetes-native
Each evaluation runs as an isolated Kubernetes Job. Multi-tenant, OCI artifact storage, and MLflow experiment tracking built in.
CI/CD ready
Block on results with —wait, export JSON for downstream tooling,
and automate with the Python SDK or REST API.