Tool Reference
The EvalHub MCP server exposes four tools: one for discovering evaluation providers and three for managing evaluation jobs.
Tool: discover_providers
Section titled “Tool: discover_providers”Discover evaluation providers using agent metadata. Filter by target type and capability tags to find the right provider for a use case. Each result includes a summary, usage hints, result interpretation guidance, and complementary provider suggestions.
Parameters
Section titled “Parameters”| Parameter | Type | Required | Description |
|---|---|---|---|
target_type | string | No | Filter by target type: model, agent, or inference_server |
evaluates | string[] | No | Filter to providers whose agent.evaluates includes all listed tags |
When any filter is set, providers without an agent block are excluded.
Example request
Section titled “Example request”Find model providers that evaluate safety:
{ "evaluates": ["safety"], "target_type": "model"}Example response
Section titled “Example response”{ "providers": [ { "id": "garak", "name": "garak", "title": "Garak", "summary": "Red-team an LLM for safety vulnerabilities, toxicity, and OWASP risks", "target_type": "model", "evaluates": ["safety", "security", "red_teaming", "toxicity"], "hints": [ "The model endpoint must support OpenAI-compatible chat completions", "The 'quick' benchmark runs a single DAN probe for fast smoke testing (~2 min)" ], "result_interpretation": [ "attack_success_rate measures how often the model was successfully exploited", "LOWER is better -- 0.0 means no attacks succeeded", "Scores above 0.3 indicate significant vulnerability" ], "complements": ["lm_evaluation_harness", "guidellm"], "recommended_when": [ "User asks about model safety or toxicity", "Pre-deployment safety gate" ] } ]}Response metadata
Section titled “Response metadata”The response includes a _meta object with diagnostic fields useful for debugging filter behavior:
| Field | Description |
|---|---|
target_types_found | Comma-separated target types present in results |
target_type | The target_type filter that was applied |
evaluates_found | Comma-separated evaluates tags present in results |
evaluates | The evaluates filter that was applied |
- Without filters, all providers are returned (including those without
agentmetadata). - Prefer this tool over reading
evalhub://providerswhen you need filtered, agent-oriented summaries. recommended_whenandcomplementsare returned for display; they are not filterable parameters.- See Agent Discoverability for the full metadata model.
Tool: submit_evaluation
Section titled “Tool: submit_evaluation”Submit a new model evaluation job.
Parameters
Section titled “Parameters”| Parameter | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Job name |
description | string | No | Job description |
tags | string[] | No | Tags for the job |
model | object | Yes | Model configuration (see below) |
benchmarks | object[] | No | List of benchmarks to run (mutually exclusive with collection) |
collection | object | No | Pre-defined benchmark collection (mutually exclusive with benchmarks) |
experiment | object | No | MLflow experiment configuration |
model object:
| Field | Type | Required | Description |
|---|---|---|---|
url | string | Yes | Model inference endpoint URL |
name | string | Yes | Model display name |
auth_secret | string | No | Kubernetes Secret reference for model endpoint authentication |
benchmarks array items:
| Field | Type | Required | Description |
|---|---|---|---|
id | string | Yes | Benchmark identifier |
provider_id | string | Yes | Provider that runs this benchmark |
collection object:
| Field | Type | Required | Description |
|---|---|---|---|
id | string | Yes | Collection identifier |
experiment object:
| Field | Type | Required | Description |
|---|---|---|---|
name | string | No | MLflow experiment name |
tags | object | No | Key-value tags for the experiment |
artifact_location | string | No | MLflow artifacts storage path |
Example request
Section titled “Example request”{ "name": "gpt-4o-leaderboard", "description": "Leaderboard evaluation of GPT-4o", "model": { "url": "https://api.openai.com/v1", "name": "gpt-4o" }, "collection": { "id": "leaderboard-v2" }, "experiment": { "name": "gpt-4o-may-2026" }}Example response
Section titled “Example response”{ "job_id": "job-a1b2c3d4", "state": "pending"}- You must provide either
benchmarksorcollection, not both. - If
benchmarksis provided, it must not be empty. - Use
get_job_statusto monitor the submitted job.
Tool: cancel_job
Section titled “Tool: cancel_job”Cancel a running or pending evaluation job.
Parameters
Section titled “Parameters”| Parameter | Type | Required | Description |
|---|---|---|---|
job_id | string | Yes | The job identifier to cancel |
Example request
Section titled “Example request”{ "job_id": "job-a1b2c3d4"}Example response
Section titled “Example response”{ "job_id": "job-a1b2c3d4", "message": "Job job-a1b2c3d4 cancelled successfully"}- Cancellation stops running benchmarks and marks them as cancelled.
- Use
get_job_statusto verify the final state after cancellation.
Tool: get_job_status
Section titled “Tool: get_job_status”Get the current status of an evaluation job with progress and per-benchmark details.
Parameters
Section titled “Parameters”| Parameter | Type | Required | Description |
|---|---|---|---|
job_id | string | Yes | The job identifier to check |
Example request
Section titled “Example request”{ "job_id": "job-a1b2c3d4"}Example response
Section titled “Example response”{ "job_id": "job-a1b2c3d4", "state": "running", "progress_percent": 50, "benchmarks": [ { "id": "mmlu", "provider_id": "lm-evaluation-harness", "status": "completed", "started_at": "2026-05-21T10:00:00Z", "completed_at": "2026-05-21T10:15:00Z", "result_interpretation": "Higher is better. Measures broad academic knowledge across 57 subjects.", "complements": ["hellaswag", "arc_challenge"] }, { "id": "hellaswag", "provider_id": "lm-evaluation-harness", "status": "running", "started_at": "2026-05-21T10:15:00Z" } ], "created_at": "2026-05-21T09:59:00Z", "started_at": "2026-05-21T10:00:00Z"}Job states
Section titled “Job states”| State | Description |
|---|---|
pending | Job is queued and waiting to start |
running | One or more benchmarks are executing |
completed | All benchmarks finished successfully |
failed | One or more benchmarks failed |
cancelled | Job was cancelled by the user |
partially_failed | Some benchmarks completed, others failed |
- This tool is designed for polling — call it repeatedly to monitor a running evaluation.
progress_percentranges from 0 to 100.- Each benchmark entry includes individual timestamps when available.