Local-First Test Harness for AI Agents
Run, test, and debug your agents anywhere — no cloud, no lock-in, no noise.
Most “agent sandboxes” are cloud-heavy and tied to a single framework. AgentLab is lightweight, offline, and framework-agnostic — giving you a reproducible developer experience for building and testing AI agents from your terminal.
Perfect for
- Backend & AI engineers who want fast iteration without cloud friction
- Integrating agents into existing systems
- Reproducible, CI-friendly agent tests
- Local-First — Works with Ollama (default model:
qwen3:8b) or any HTTP LLM endpoint - Framework-Agnostic — Blueprints are plain YAML; adapters can target LangChain/LlamaIndex later
- Reproducible Tests — Define evaluation cases alongside the blueprint
- Mockable Tools — Run deterministic CI without hitting real APIs
- Developer-First UX — CLI-first, Git-friendly configs
# 0) Ensure Ollama has the model
ollama pull qwen3:8b
# 1) (Optional) Create venv
python -m venv .venv && source .venv/bin/activate
# 2) Install
pip install -e .
# 3) Run an example agent (summarizer)
agentlab run blueprints/summarizer.yaml \
-i "User failed login due to rate limits." \
--model qwen3:8b
# 4) Run its evaluations
agentlab eval blueprints/summarizer.yaml# Create a new blueprint (and an optional test)
agentlab init my-agent --out blueprints --tests
# Run it
agentlab run blueprints/my-agent.yaml -i "Sample input" --model qwen3:8b
# Run its evaluation cases
agentlab eval blueprints/my-agent.yaml --model qwen3:8b--strip-think(run): remove<think>…</think>tags from final output.--no-strip-think(eval): by default eval strips; use this to disable.--junit <path>(eval): write JUnit XML report.--temperature,--top-p: generation controls (default 0 and 1 for determinism).
Run a second example focused on operations-style summaries.
agentlab run blueprints/incident-triage.yaml \
-i "Database connection timeout after 5 retries." \
--model qwen3:8bname: incident-triage
description: Triage incident statements into a concise, actionable summary.
system_prompt: |
You are an incident triage assistant. Produce a single sentence that states the core issue
and the likely action. Avoid hedging. If a cause is clear (e.g., rate limit, timeout),
include it directly.
tools: []
memory:
strategy: short_term
plan:
- step: generate
name: final
evaluation:
- input: "Database connection timeout after 5 retries."
expected: "timeout"
- input: "User login failed due to exceeding rate limits."
expected: "rate limit"- Blueprint: YAML spec for agent purpose, tools, memory, plan, and eval cases
- Plan: ordered steps (currently
tool_use|note|generate) - Tools: mocked for local dev; real tool adapters can be added later
- LLM: local via Ollama (Qwen3:8b by default)
- Phase 1 (MVP): CLI, YAML config, Ollama, mocks, basic evals ✅
- Phase 2:
agentlab init, streaming CLI, prompt templating, richer evaluators - Phase 3: TUI mode, adapters (LangChain/LlamaIndex), exportable reports
See docs/plugins.md for how to create and publish external tools via entry points.
Ingest an OpenAPI spec and register tools at runtime (supports URL or file):
# Live Swagger Petstore demo
agentlab run blueprints/petstore.yaml -i "" \
--openapi-spec https://petstore3.swagger.io/api/v3/openapi.json \
--openapi-tag pet \
--openapi-base-url https://petstore3.swagger.io/api/v3 \
--model qwen3:8bMIT