Rhesis: Collaborative Testing for LLM & Agentic Applications

Website · Docs · Discord · Changelog

More than just evals.
Collaborative agent testing for teams.

Generate tests from requirements, simulate conversation flows, detect adversarial behaviors, evaluate with 60+ metrics, and trace failures with OpenTelemetry. Engineers and domain experts, working together.

Core features

Test generation

AI-Powered Synthesis - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts.

Knowledge-Aware - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation.

Single-turn & conversation simulation

Single-turn for Q&A validation. Conversation simulation for dialogue flows.

Penelope Agent simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions.

Adversarial testing (red-teaming)

Polyphemus Agent proactively finds vulnerabilities:

Jailbreak attempts and prompt injection
PII leakage and data extraction
Harmful content generation
Role violation and instruction bypassing

Garak Integration - Built-in support for garak, the LLM vulnerability scanner, for comprehensive security testing.

60+ pre-built metrics

Framework	Example Metrics
RAGAS	Context relevance, faithfulness, answer accuracy
DeepEval	Bias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention
Garak	Jailbreak detection, prompt injection, XSS, malware generation, data leakage
Custom	NumericJudge, CategoricalJudge for domain-specific evaluation

All metrics include LLM-as-Judge reasoning explanations.

Traces & observability

Monitor your LLM applications with OpenTelemetry-based tracing:

from rhesis.sdk.decorators import observe

@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
    # Your LLM call here
    return response

Track LLM calls, latency, token usage, and link traces to test results for debugging.

Bring your own model

Use any LLM provider for test generation and evaluation. Provider routing is powered by LiteLLM under the hood, giving you a single interface to 100+ models:

Cloud: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI

Local/Self-hosted: Ollama, vLLM, LiteLLM

See Model Configuration Docs for setup instructions.

Why Rhesis?

Platform for teams. SDK for developers.

Use the collaborative platform for team-based testing: product managers define requirements, domain experts review results, engineers integrate via CI/CD. Or integrate directly with the Python SDK for code-first workflows.

The testing lifecycle

Six integrated phases from project setup to team collaboration:

Phase	What You Do
1. Projects	Configure your AI application, upload & connect context sources (files, docs), set up SDK connectors
2. Requirements	Define expected behaviors (what your app should and shouldn't do), cover all relevant aspects from product, marketing, customer support, legal and compliance teams
3. Metrics	Select from 60+ pre-built metrics or create custom LLM-as-Judge evaluations to assess whether your requirements are met
4. Tests	Generate single-turn and conversation simulation test scenarios. Organize in test sets and understand your test coverage
5. Execution	Run tests via UI, SDK, or API; integrate into CI/CD pipelines; collect traces during execution
6. Collaboration	Review results with your team through comments, tasks, workflows, and side-by-side comparisons

Rhesis vs...

Instead of...	Rhesis gives you...
Manual testing	AI-generated test cases based on your context, hundreds in minutes
Traditional test frameworks	Non-deterministic output handling built-in
LLM observability tools	Pre-production validation, not post-production monitoring
Red-teaming services	Continuous, self-service adversarial testing, not one-time audits

What you can test

Use Case	What Rhesis Tests
Conversational AI	Conversation simulation, role adherence, knowledge retention
RAG Systems	Context relevance, faithfulness, hallucination detection
NL-to-SQL / NL-to-Code	Query accuracy, syntax validation, edge case handling
Agentic Systems	Tool selection, goal achievement, multi-agent coordination

SDK: Code-first testing

Test your Python functions directly with the @endpoint decorator:

from rhesis.sdk.decorators import endpoint

@endpoint(name="my-chatbot")
def chat(message: str) -> str:
    # Your LLM logic here
    return response

Features: Zero configuration, automatic parameter binding, auto-reconnection, environment management (dev/staging/production).

Generate tests programmatically:

from rhesis.sdk.synthesizers import PromptSynthesizer

synthesizer = PromptSynthesizer(
    prompt="Generate tests for a medical chatbot that must never provide diagnosis"
)
test_set = synthesizer.generate(num_tests=10)

Deployment options

Option	Best For	Setup Time
Rhesis Cloud	Teams wanting managed deployment	Instant
Docker	Local development and testing	5 minutes
Kubernetes	Production self-hosting	See docs

Quick Start

Option 1: Cloud (fastest) - app.rhesis.ai - Managed service, just connect your app

Option 2: Self-host with Docker

git clone https://github.com/rhesis-ai/rhesis.git && cd rhesis && ./rh start

./rh start pulls prebuilt images from GHCR. To build images from the repo instead, use ./rh start --build (and ./rh restart --build after local Dockerfile changes).

Access: Frontend at localhost:3000, API at localhost:8080/docs

Commands: ./rh logs · ./rh stop · ./rh restart · ./rh delete

Note: This setup enables auto-login for local testing. For production, see Self-hosting Documentation.

Option 3: Python SDK

pip install rhesis-sdk

Integrations

Rhesis integrates with your LLM stack across four layers, each addressing a different concern:

Layer	What it covers
LLM providers	The model that runs your test generation and LLM-as-Judge evaluation
Tracing	Streaming spans from your application to Rhesis over OpenTelemetry
Test execution	Letting Rhesis invoke entry points in your application remotely to run test cases
REST API	Programmatic access to test sets, runs, and platform resources

LLM providers (test generation & judges)

Choose any provider for the LLMs that drive test synthesis and LLM-as-Judge evaluation. Provider routing is powered by LiteLLM, giving you a single interface to 100+ models.

Integration	Languages	Description
OpenAI	Python	OpenAI supported models and embeddings.
Anthropic	Python	Native support for Claude models.
Google Gemini	Python	Native integration for Google's Gemini models.
Vertex AI	Python	Google Cloud Vertex AI model support.
Ollama	Python	Local LLM deployment with Ollama integration.
OpenRouter	Python	Access to multiple LLM providers through OpenRouter.
HuggingFace	Python	Direct integration with HuggingFace models.
LiteLLM	Python	Unified interface for 100+ LLMs (OpenAI, Azure, Anthropic, Cohere, Ollama, vLLM, HuggingFace, Replicate).

Tracing your application

Your application emits spans through the Rhesis SDK; spans are batched and sent to Rhesis over HTTP using OpenTelemetry span conventions. The integration mechanism depends on the framework — auto-instrumented frameworks need no code changes, while others use the @observe.llm decorator to mark the boundaries you want traced.

Integration	Languages	Mechanism	Description
Rhesis SDK	Python, JS/TS	Decorators	Native SDK with `@observe.llm` and convenience variants (`@observe.tool`, `@observe.retrieval`, `@observe.embedding`, …). Wrap any function you want traced.
LangChain	Python	✅ Automatic	Add the Rhesis callback handler once and every chain step, tool call, and LLM call is traced automatically — no per-function decorators required.
LangGraph	Python	✅ Automatic	Built-in integration for LangGraph agent workflows with full observability — every node transition, tool invocation, and graph step is captured automatically.
OpenTelemetry / OpenInference	Python	✅ Automatic via OTel	Any framework with an OpenInference instrumentor (LlamaIndex, CrewAI, OpenAI Agents SDK, Google ADK, Pydantic AI, DSPy, Haystack, Semantic Kernel) exports to Rhesis through the SDK's OTel-based exporter. See Tracing setup docs for exact endpoint and header configuration.
AutoGen, OpenAI Agents SDK, LlamaIndex, CrewAI, and others	Python	Decorators	Wrap the functions, tools, or agents you want to trace with `@observe.llm`. Without decorators, only top-level inputs and outputs are captured.

Test execution: the connector

For Rhesis to run test cases against your application, it needs a way to call your code from outside your environment. The Rhesis SDK provides a persistent outbound WebSocket connection — your application opens it at startup and Rhesis can then invoke registered entry points whenever a test run fires. The connection is outbound from your app, so it works through firewalls and from local laptops without exposing a public URL.

You register an entry point with the @endpoint decorator (see SDK: Code-first testing). When a test run starts, Rhesis sends each test case's input down the WebSocket; your application runs the function locally and sends the output back up the same connection. The same call path serves single-turn test cases and multi-turn conversations driven by Penelope (our multi-turn conversation runner).

Both channels run from the same SDK in the same process: spans flow up the HTTP/OTLP channel; test commands flow down the WebSocket. Production traffic and test traffic produce traces in the same format, so the same evaluation metrics grade both.

REST API

Direct API access for custom integrations and CI/CD pipelines: manage test sets, trigger test runs, fetch results, and inspect traces programmatically. Language-agnostic — call from Python, TypeScript, Go, shell scripts, or anywhere else. OpenAPI spec available.

See Integration Docs for setup instructions.

Open source

MIT licensed. No plans to relicense core features. Enterprise version will live in ee/ folders and remain separate.

We built Rhesis because existing LLM testing tools didn't meet our needs. If you face the same challenges, contributions are welcome.

Contributing

See CONTRIBUTING.md for guidelines.

Ways to contribute: Fix bugs or add features · Contribute test sets for common failure modes · Improve documentation · Help others in Discord or GitHub discussions

Support

Documentation - Guides and API reference
Discord - Community support
GitHub Issues - Bug reports and feature requests

Security & privacy

We take data security seriously. See our Privacy Policy for details.

Telemetry: Rhesis collects basic, anonymized usage statistics to improve the product. No sensitive data is collected or shared with third parties.

Self-hosted: Opt out by setting OTEL_RHESIS_TELEMETRY_ENABLED=false
Cloud: Telemetry enabled as part of Terms & Conditions

Made with in Potsdam, Germany 🇩🇪

Learn more at rhesis.ai

Name		Name	Last commit message	Last commit date
Latest commit History 4,133 Commits
.claude-plugin		.claude-plugin
.claude		.claude
.cursor		.cursor
.github		.github
.vscode		.vscode
agents/research-assistant		agents/research-assistant
apps		apps
charts/rhesis		charts/rhesis
docs		docs
ee		ee
examples		examples
infrastructure		infrastructure
kubernetes		kubernetes
packages/rhesis		packages/rhesis
penelope		penelope
scripts		scripts
sdk		sdk
skills/rhesis		skills/rhesis
terraform/infrastructure		terraform/infrastructure
tests		tests
.codecov.yml		.codecov.yml
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
RELEASING.md		RELEASING.md
VERSION		VERSION
docker-compose.ghcr.yml		docker-compose.ghcr.yml
docker-compose.yml		docker-compose.yml
playground		playground
release_config.json		release_config.json
rh		rh
ruff.toml		ruff.toml
simulations		simulations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rhesis: Collaborative Testing for LLM & Agentic Applications

More than just evals.
Collaborative agent testing for teams.

Core features

Test generation

Single-turn & conversation simulation

Adversarial testing (red-teaming)

60+ pre-built metrics

Traces & observability

Bring your own model

Why Rhesis?

The testing lifecycle

Rhesis vs...

What you can test

SDK: Code-first testing

Deployment options

Quick Start

Integrations

LLM providers (test generation & judges)

Tracing your application

Test execution: the connector

REST API

Open source

Contributing

Support

Security & privacy

About

Uh oh!

Releases 126

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rhesis: Collaborative Testing for LLM & Agentic Applications

More than just evals.Collaborative agent testing for teams.

Core features

Test generation

Single-turn & conversation simulation

Adversarial testing (red-teaming)

60+ pre-built metrics

Traces & observability

Bring your own model

Why Rhesis?

The testing lifecycle

Rhesis vs...

What you can test

SDK: Code-first testing

Deployment options

Quick Start

Integrations

LLM providers (test generation & judges)

Tracing your application

Test execution: the connector

REST API

Open source

Contributing

Support

Security & privacy

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 126

Uh oh!

Contributors

Uh oh!

Languages

More than just evals.
Collaborative agent testing for teams.