Open source LLM evaluation tools

Promptfoo — CLI for prompt testing, red-teaming, and model comparison

DeepEval — Pytest-style framework with 50+ LLM metrics for RAG

Ragas — RAG evaluation with auto test dataset generation

OpenAI Evals — Benchmark registry with automated and human grading

Comet Opik — Tracing, evaluation, and monitoring for LLM agents

Open source Sandboxing tools

claude-vm - Claude VM Run Claude within a VM

nono - Secure, kernel-enforced sandbox CLI and SDKs for AI agents.

Docker - Containerization tool

RTK - CLI proxy that reduces LLM token consumption

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md