EvalHub

AI evaluation,
done right.

EvalHub orchestrates reproducible LLM evaluations on Kubernetes — broad benchmark coverage, curated domain collections, and native CI/CD support.

Getting Started GitHub

# Install the CLI
pip install "eval-hub-sdk[cli]"

# Connect to your cluster
evalhub config set base_url https://evalhub.apps.cluster.example.com
evalhub config set token $(kubectl create token evalhub-sa -n evalhub)

# Run an evaluation
evalhub eval run \
--model-url  https://llama3.example.com/v1 \
--model-name meta-llama/Llama-3.2-8B-Instruct \
--benchmark  mmlu --wait

# Job submitted: eval-a1b2c3d4
# ✓ completed  mmlu       acc       0.723
# ✓ completed  hellaswag  acc_norm  0.801

What’s inside

Multi-framework benchmarks

Evaluate against MMLU, HellaSwag, GSM8K, and more via lm-evaluation-harness, Garak, GuideLLM, and LightEval — all from a single CLI command.

Domain collections

Curated benchmark sets for healthcare, finance, and automotive safety with weighted scoring and configurable pass thresholds.

Kubernetes-native

Each evaluation runs as an isolated Kubernetes Job. Multi-tenant, OCI artifact storage, and MLflow experiment tracking built in.

CI/CD ready

Block on results with —wait, export JSON for downstream tooling, and automate with the Python SDK or REST API.

Explore AI Evaluations with EvalHub

Getting Started GitHub

EvalHub

AI evaluation,done right.

Explore AI Evaluations with EvalHub

AI evaluation,
done right.