Website · Docs · Discord · Changelog
Generate tests from requirements, simulate conversation flows, detect adversarial behaviors, evaluate with 60+ metrics, and trace failures with OpenTelemetry. Engineers and domain experts, working together.
AI-Powered Synthesis - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts.
Knowledge-Aware - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation.
Single-turn for Q&A validation. Conversation simulation for dialogue flows.
Penelope Agent simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions.
Polyphemus Agent proactively finds vulnerabilities:
- Jailbreak attempts and prompt injection
- PII leakage and data extraction
- Harmful content generation
- Role violation and instruction bypassing
Garak Integration - Built-in support for garak, the LLM vulnerability scanner, for comprehensive security testing.
| Framework | Example Metrics |
|---|---|
| RAGAS | Context relevance, faithfulness, answer accuracy |
| DeepEval | Bias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention |
| Garak | Jailbreak detection, prompt injection, XSS, malware generation, data leakage |
| Custom | NumericJudge, CategoricalJudge for domain-specific evaluation |
All metrics include LLM-as-Judge reasoning explanations.
Monitor your LLM applications with OpenTelemetry-based tracing:
from rhesis.sdk.decorators import observe
@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
# Your LLM call here
return responseTrack LLM calls, latency, token usage, and link traces to test results for debugging.
Use any LLM provider for test generation and evaluation. Provider routing is powered by LiteLLM under the hood, giving you a single interface to 100+ models:
Cloud: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI
Local/Self-hosted: Ollama, vLLM, LiteLLM
See Model Configuration Docs for setup instructions.
Platform for teams. SDK for developers.
Use the collaborative platform for team-based testing: product managers define requirements, domain experts review results, engineers integrate via CI/CD. Or integrate directly with the Python SDK for code-first workflows.
Six integrated phases from project setup to team collaboration:
| Phase | What You Do |
|---|---|
| 1. Projects | Configure your AI application, upload & connect context sources (files, docs), set up SDK connectors |
| 2. Requirements | Define expected behaviors (what your app should and shouldn't do), cover all relevant aspects from product, marketing, customer support, legal and compliance teams |
| 3. Metrics | Select from 60+ pre-built metrics or create custom LLM-as-Judge evaluations to assess whether your requirements are met |
| 4. Tests | Generate single-turn and conversation simulation test scenarios. Organize in test sets and understand your test coverage |
| 5. Execution | Run tests via UI, SDK, or API; integrate into CI/CD pipelines; collect traces during execution |
| 6. Collaboration | Review results with your team through comments, tasks, workflows, and side-by-side comparisons |
| Instead of... | Rhesis gives you... |
|---|---|
| Manual testing | AI-generated test cases based on your context, hundreds in minutes |
| Traditional test frameworks | Non-deterministic output handling built-in |
| LLM observability tools | Pre-production validation, not post-production monitoring |
| Red-teaming services | Continuous, self-service adversarial testing, not one-time audits |
| Use Case | What Rhesis Tests |
|---|---|
| Conversational AI | Conversation simulation, role adherence, knowledge retention |
| RAG Systems | Context relevance, faithfulness, hallucination detection |
| NL-to-SQL / NL-to-Code | Query accuracy, syntax validation, edge case handling |
| Agentic Systems | Tool selection, goal achievement, multi-agent coordination |
Test your Python functions directly with the @endpoint decorator:
from rhesis.sdk.decorators import endpoint
@endpoint(name="my-chatbot")
def chat(message: str) -> str:
# Your LLM logic here
return responseFeatures: Zero configuration, automatic parameter binding, auto-reconnection, environment management (dev/staging/production).
Generate tests programmatically:
from rhesis.sdk.synthesizers import PromptSynthesizer
synthesizer = PromptSynthesizer(
prompt="Generate tests for a medical chatbot that must never provide diagnosis"
)
test_set = synthesizer.generate(num_tests=10)| Option | Best For | Setup Time |
|---|---|---|
| Rhesis Cloud | Teams wanting managed deployment | Instant |
| Docker | Local development and testing | 5 minutes |
| Kubernetes | Production self-hosting | See docs |
Option 1: Cloud (fastest) - app.rhesis.ai - Managed service, just connect your app
Option 2: Self-host with Docker
git clone https://github.com/rhesis-ai/rhesis.git && cd rhesis && ./rh start./rh start pulls prebuilt images from GHCR. To build images from the repo instead, use ./rh start --build (and ./rh restart --build after local Dockerfile changes).
Access: Frontend at localhost:3000, API at localhost:8080/docs
Commands: ./rh logs · ./rh stop · ./rh restart · ./rh delete
Note: This setup enables auto-login for local testing. For production, see Self-hosting Documentation.
Option 3: Python SDK
pip install rhesis-sdkRhesis integrates with your LLM stack across four layers, each addressing a different concern:
| Layer | What it covers |
|---|---|
| LLM providers | The model that runs your test generation and LLM-as-Judge evaluation |
| Tracing | Streaming spans from your application to Rhesis over OpenTelemetry |
| Test execution | Letting Rhesis invoke entry points in your application remotely to run test cases |
| REST API | Programmatic access to test sets, runs, and platform resources |
Choose any provider for the LLMs that drive test synthesis and LLM-as-Judge evaluation. Provider routing is powered by LiteLLM, giving you a single interface to 100+ models.
| Integration | Languages | Description |
|---|---|---|
| OpenAI | Python | OpenAI supported models and embeddings. |
| Anthropic | Python | Native support for Claude models. |
| Google Gemini | Python | Native integration for Google's Gemini models. |
| Vertex AI | Python | Google Cloud Vertex AI model support. |
| Ollama | Python | Local LLM deployment with Ollama integration. |
| OpenRouter | Python | Access to multiple LLM providers through OpenRouter. |
| HuggingFace | Python | Direct integration with HuggingFace models. |
| LiteLLM | Python | Unified interface for 100+ LLMs (OpenAI, Azure, Anthropic, Cohere, Ollama, vLLM, HuggingFace, Replicate). |
Your application emits spans through the Rhesis SDK; spans are batched and sent to Rhesis over HTTP using OpenTelemetry span conventions. The integration mechanism depends on the framework — auto-instrumented frameworks need no code changes, while others use the @observe.llm decorator to mark the boundaries you want traced.
| Integration | Languages | Mechanism | Description |
|---|---|---|---|
| Rhesis SDK | Python, JS/TS | Decorators | Native SDK with @observe.llm and convenience variants (@observe.tool, @observe.retrieval, @observe.embedding, …). Wrap any function you want traced. |
| LangChain | Python | ✅ Automatic | Add the Rhesis callback handler once and every chain step, tool call, and LLM call is traced automatically — no per-function decorators required. |
| LangGraph | Python | ✅ Automatic | Built-in integration for LangGraph agent workflows with full observability — every node transition, tool invocation, and graph step is captured automatically. |
| OpenTelemetry / OpenInference | Python | ✅ Automatic via OTel | Any framework with an OpenInference instrumentor (LlamaIndex, CrewAI, OpenAI Agents SDK, Google ADK, Pydantic AI, DSPy, Haystack, Semantic Kernel) exports to Rhesis through the SDK's OTel-based exporter. See Tracing setup docs for exact endpoint and header configuration. |
| AutoGen, OpenAI Agents SDK, LlamaIndex, CrewAI, and others | Python | Decorators | Wrap the functions, tools, or agents you want to trace with @observe.llm. Without decorators, only top-level inputs and outputs are captured. |
For Rhesis to run test cases against your application, it needs a way to call your code from outside your environment. The Rhesis SDK provides a persistent outbound WebSocket connection — your application opens it at startup and Rhesis can then invoke registered entry points whenever a test run fires. The connection is outbound from your app, so it works through firewalls and from local laptops without exposing a public URL.
You register an entry point with the @endpoint decorator (see SDK: Code-first testing). When a test run starts, Rhesis sends each test case's input down the WebSocket; your application runs the function locally and sends the output back up the same connection. The same call path serves single-turn test cases and multi-turn conversations driven by Penelope (our multi-turn conversation runner).
Both channels run from the same SDK in the same process: spans flow up the HTTP/OTLP channel; test commands flow down the WebSocket. Production traffic and test traffic produce traces in the same format, so the same evaluation metrics grade both.
Direct API access for custom integrations and CI/CD pipelines: manage test sets, trigger test runs, fetch results, and inspect traces programmatically. Language-agnostic — call from Python, TypeScript, Go, shell scripts, or anywhere else. OpenAPI spec available.
See Integration Docs for setup instructions.
MIT licensed. No plans to relicense core features. Enterprise version will live in ee/ folders and remain separate.
We built Rhesis because existing LLM testing tools didn't meet our needs. If you face the same challenges, contributions are welcome.
See CONTRIBUTING.md for guidelines.
Ways to contribute: Fix bugs or add features · Contribute test sets for common failure modes · Improve documentation · Help others in Discord or GitHub discussions
- Documentation - Guides and API reference
- Discord - Community support
- GitHub Issues - Bug reports and feature requests
We take data security seriously. See our Privacy Policy for details.
Telemetry: Rhesis collects basic, anonymized usage statistics to improve the product. No sensitive data is collected or shared with third parties.
- Self-hosted: Opt out by setting
OTEL_RHESIS_TELEMETRY_ENABLED=false - Cloud: Telemetry enabled as part of Terms & Conditions