Skip to content

Tool Reference

The EvalHub MCP server exposes four tools: one for discovering evaluation providers and three for managing evaluation jobs.

Discover evaluation providers using agent metadata. Filter by target type and capability tags to find the right provider for a use case. Each result includes a summary, usage hints, result interpretation guidance, and complementary provider suggestions.

ParameterTypeRequiredDescription
target_typestringNoFilter by target type: model, agent, or inference_server
evaluatesstring[]NoFilter to providers whose agent.evaluates includes all listed tags

When any filter is set, providers without an agent block are excluded.

Find model providers that evaluate safety:

{
"evaluates": ["safety"],
"target_type": "model"
}
{
"providers": [
{
"id": "garak",
"name": "garak",
"title": "Garak",
"summary": "Red-team an LLM for safety vulnerabilities, toxicity, and OWASP risks",
"target_type": "model",
"evaluates": ["safety", "security", "red_teaming", "toxicity"],
"hints": [
"The model endpoint must support OpenAI-compatible chat completions",
"The 'quick' benchmark runs a single DAN probe for fast smoke testing (~2 min)"
],
"result_interpretation": [
"attack_success_rate measures how often the model was successfully exploited",
"LOWER is better -- 0.0 means no attacks succeeded",
"Scores above 0.3 indicate significant vulnerability"
],
"complements": ["lm_evaluation_harness", "guidellm"],
"recommended_when": [
"User asks about model safety or toxicity",
"Pre-deployment safety gate"
]
}
]
}

The response includes a _meta object with diagnostic fields useful for debugging filter behavior:

FieldDescription
target_types_foundComma-separated target types present in results
target_typeThe target_type filter that was applied
evaluates_foundComma-separated evaluates tags present in results
evaluatesThe evaluates filter that was applied
  • Without filters, all providers are returned (including those without agent metadata).
  • Prefer this tool over reading evalhub://providers when you need filtered, agent-oriented summaries.
  • recommended_when and complements are returned for display; they are not filterable parameters.
  • See Agent Discoverability for the full metadata model.

Submit a new model evaluation job.

ParameterTypeRequiredDescription
namestringYesJob name
descriptionstringNoJob description
tagsstring[]NoTags for the job
modelobjectYesModel configuration (see below)
benchmarksobject[]NoList of benchmarks to run (mutually exclusive with collection)
collectionobjectNoPre-defined benchmark collection (mutually exclusive with benchmarks)
experimentobjectNoMLflow experiment configuration

model object:

FieldTypeRequiredDescription
urlstringYesModel inference endpoint URL
namestringYesModel display name
auth_secretstringNoKubernetes Secret reference for model endpoint authentication

benchmarks array items:

FieldTypeRequiredDescription
idstringYesBenchmark identifier
provider_idstringYesProvider that runs this benchmark

collection object:

FieldTypeRequiredDescription
idstringYesCollection identifier

experiment object:

FieldTypeRequiredDescription
namestringNoMLflow experiment name
tagsobjectNoKey-value tags for the experiment
artifact_locationstringNoMLflow artifacts storage path
{
"name": "gpt-4o-leaderboard",
"description": "Leaderboard evaluation of GPT-4o",
"model": {
"url": "https://api.openai.com/v1",
"name": "gpt-4o"
},
"collection": {
"id": "leaderboard-v2"
},
"experiment": {
"name": "gpt-4o-may-2026"
}
}
{
"job_id": "job-a1b2c3d4",
"state": "pending"
}
  • You must provide either benchmarks or collection, not both.
  • If benchmarks is provided, it must not be empty.
  • Use get_job_status to monitor the submitted job.

Cancel a running or pending evaluation job.

ParameterTypeRequiredDescription
job_idstringYesThe job identifier to cancel
{
"job_id": "job-a1b2c3d4"
}
{
"job_id": "job-a1b2c3d4",
"message": "Job job-a1b2c3d4 cancelled successfully"
}
  • Cancellation stops running benchmarks and marks them as cancelled.
  • Use get_job_status to verify the final state after cancellation.

Get the current status of an evaluation job with progress and per-benchmark details.

ParameterTypeRequiredDescription
job_idstringYesThe job identifier to check
{
"job_id": "job-a1b2c3d4"
}
{
"job_id": "job-a1b2c3d4",
"state": "running",
"progress_percent": 50,
"benchmarks": [
{
"id": "mmlu",
"provider_id": "lm-evaluation-harness",
"status": "completed",
"started_at": "2026-05-21T10:00:00Z",
"completed_at": "2026-05-21T10:15:00Z",
"result_interpretation": "Higher is better. Measures broad academic knowledge across 57 subjects.",
"complements": ["hellaswag", "arc_challenge"]
},
{
"id": "hellaswag",
"provider_id": "lm-evaluation-harness",
"status": "running",
"started_at": "2026-05-21T10:15:00Z"
}
],
"created_at": "2026-05-21T09:59:00Z",
"started_at": "2026-05-21T10:00:00Z"
}
StateDescription
pendingJob is queued and waiting to start
runningOne or more benchmarks are executing
completedAll benchmarks finished successfully
failedOne or more benchmarks failed
cancelledJob was cancelled by the user
partially_failedSome benchmarks completed, others failed
  • This tool is designed for polling — call it repeatedly to monitor a running evaluation.
  • progress_percent ranges from 0 to 100.
  • Each benchmark entry includes individual timestamps when available.