Promptfoo — CLI for prompt testing, red-teaming, and model comparison
DeepEval — Pytest-style framework with 50+ LLM metrics for RAG
Ragas — RAG evaluation with auto test dataset generation
OpenAI Evals — Benchmark registry with automated and human grading
Comet Opik — Tracing, evaluation, and monitoring for LLM agents
claude-vm - Claude VM Run Claude within a VM
nono - Secure, kernel-enforced sandbox CLI and SDKs for AI agents.
Docker - Containerization tool
RTK - CLI proxy that reduces LLM token consumption