Skip to content

rhesis-ai/rhesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4,133 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rhesis AI Logo

Rhesis: Collaborative Testing for LLM & Agentic Applications

License PyPI Version Python Versions codecov Discord LinkedIn Hugging Face Documentation

Website · Docs · Discord · Changelog

More than just evals.
Collaborative agent testing for teams.

Generate tests from requirements, simulate conversation flows, detect adversarial behaviors, evaluate with 60+ metrics, and trace failures with OpenTelemetry. Engineers and domain experts, working together.

Rhesis Platform Overview - Click to watch demo


Core features

Rhesis Core Features

Test generation

AI-Powered Synthesis - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts.

Knowledge-Aware - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation.

Single-turn & conversation simulation

Single-turn for Q&A validation. Conversation simulation for dialogue flows.

Penelope Agent simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions.

Adversarial testing (red-teaming)

Polyphemus Agent proactively finds vulnerabilities:

  • Jailbreak attempts and prompt injection
  • PII leakage and data extraction
  • Harmful content generation
  • Role violation and instruction bypassing

Garak Integration - Built-in support for garak, the LLM vulnerability scanner, for comprehensive security testing.

60+ pre-built metrics

Framework Example Metrics
RAGAS Context relevance, faithfulness, answer accuracy
DeepEval Bias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention
Garak Jailbreak detection, prompt injection, XSS, malware generation, data leakage
Custom NumericJudge, CategoricalJudge for domain-specific evaluation

All metrics include LLM-as-Judge reasoning explanations.

Traces & observability

Monitor your LLM applications with OpenTelemetry-based tracing:

from rhesis.sdk.decorators import observe

@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
    # Your LLM call here
    return response

Track LLM calls, latency, token usage, and link traces to test results for debugging.

Bring your own model

Use any LLM provider for test generation and evaluation. Provider routing is powered by LiteLLM under the hood, giving you a single interface to 100+ models:

Cloud: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI

Local/Self-hosted: Ollama, vLLM, LiteLLM

See Model Configuration Docs for setup instructions.


Why Rhesis?

Platform for teams. SDK for developers.

Use the collaborative platform for team-based testing: product managers define requirements, domain experts review results, engineers integrate via CI/CD. Or integrate directly with the Python SDK for code-first workflows.

The testing lifecycle

Six integrated phases from project setup to team collaboration:

Phase What You Do
1. Projects Configure your AI application, upload & connect context sources (files, docs), set up SDK connectors
2. Requirements Define expected behaviors (what your app should and shouldn't do), cover all relevant aspects from product, marketing, customer support, legal and compliance teams
3. Metrics Select from 60+ pre-built metrics or create custom LLM-as-Judge evaluations to assess whether your requirements are met
4. Tests Generate single-turn and conversation simulation test scenarios. Organize in test sets and understand your test coverage
5. Execution Run tests via UI, SDK, or API; integrate into CI/CD pipelines; collect traces during execution
6. Collaboration Review results with your team through comments, tasks, workflows, and side-by-side comparisons

Rhesis vs...

Instead of... Rhesis gives you...
Manual testing AI-generated test cases based on your context, hundreds in minutes
Traditional test frameworks Non-deterministic output handling built-in
LLM observability tools Pre-production validation, not post-production monitoring
Red-teaming services Continuous, self-service adversarial testing, not one-time audits

What you can test

Use Case What Rhesis Tests
Conversational AI Conversation simulation, role adherence, knowledge retention
RAG Systems Context relevance, faithfulness, hallucination detection
NL-to-SQL / NL-to-Code Query accuracy, syntax validation, edge case handling
Agentic Systems Tool selection, goal achievement, multi-agent coordination

SDK: Code-first testing

Test your Python functions directly with the @endpoint decorator:

from rhesis.sdk.decorators import endpoint

@endpoint(name="my-chatbot")
def chat(message: str) -> str:
    # Your LLM logic here
    return response

Features: Zero configuration, automatic parameter binding, auto-reconnection, environment management (dev/staging/production).

Generate tests programmatically:

from rhesis.sdk.synthesizers import PromptSynthesizer

synthesizer = PromptSynthesizer(
    prompt="Generate tests for a medical chatbot that must never provide diagnosis"
)
test_set = synthesizer.generate(num_tests=10)

Deployment options

Option Best For Setup Time
Rhesis Cloud Teams wanting managed deployment Instant
Docker Local development and testing 5 minutes
Kubernetes Production self-hosting See docs

Quick Start

Option 1: Cloud (fastest) - app.rhesis.ai - Managed service, just connect your app

Option 2: Self-host with Docker

git clone https://github.com/rhesis-ai/rhesis.git && cd rhesis && ./rh start

./rh start pulls prebuilt images from GHCR. To build images from the repo instead, use ./rh start --build (and ./rh restart --build after local Dockerfile changes).

Access: Frontend at localhost:3000, API at localhost:8080/docs

Commands: ./rh logs · ./rh stop · ./rh restart · ./rh delete

Note: This setup enables auto-login for local testing. For production, see Self-hosting Documentation.

Option 3: Python SDK

pip install rhesis-sdk

Integrations

Rhesis integrates with your LLM stack across four layers, each addressing a different concern:

Layer What it covers
LLM providers The model that runs your test generation and LLM-as-Judge evaluation
Tracing Streaming spans from your application to Rhesis over OpenTelemetry
Test execution Letting Rhesis invoke entry points in your application remotely to run test cases
REST API Programmatic access to test sets, runs, and platform resources

LLM providers (test generation & judges)

Choose any provider for the LLMs that drive test synthesis and LLM-as-Judge evaluation. Provider routing is powered by LiteLLM, giving you a single interface to 100+ models.

Integration Languages Description
OpenAI Python OpenAI supported models and embeddings.
Anthropic Python Native support for Claude models.
Google Gemini Python Native integration for Google's Gemini models.
Vertex AI Python Google Cloud Vertex AI model support.
Ollama Python Local LLM deployment with Ollama integration.
OpenRouter Python Access to multiple LLM providers through OpenRouter.
HuggingFace Python Direct integration with HuggingFace models.
LiteLLM Python Unified interface for 100+ LLMs (OpenAI, Azure, Anthropic, Cohere, Ollama, vLLM, HuggingFace, Replicate).

Tracing your application

Your application emits spans through the Rhesis SDK; spans are batched and sent to Rhesis over HTTP using OpenTelemetry span conventions. The integration mechanism depends on the framework — auto-instrumented frameworks need no code changes, while others use the @observe.llm decorator to mark the boundaries you want traced.

Integration Languages Mechanism Description
Rhesis SDK Python, JS/TS Decorators Native SDK with @observe.llm and convenience variants (@observe.tool, @observe.retrieval, @observe.embedding, …). Wrap any function you want traced.
LangChain Python ✅ Automatic Add the Rhesis callback handler once and every chain step, tool call, and LLM call is traced automatically — no per-function decorators required.
LangGraph Python ✅ Automatic Built-in integration for LangGraph agent workflows with full observability — every node transition, tool invocation, and graph step is captured automatically.
OpenTelemetry / OpenInference Python ✅ Automatic via OTel Any framework with an OpenInference instrumentor (LlamaIndex, CrewAI, OpenAI Agents SDK, Google ADK, Pydantic AI, DSPy, Haystack, Semantic Kernel) exports to Rhesis through the SDK's OTel-based exporter. See Tracing setup docs for exact endpoint and header configuration.
AutoGen, OpenAI Agents SDK, LlamaIndex, CrewAI, and others Python Decorators Wrap the functions, tools, or agents you want to trace with @observe.llm. Without decorators, only top-level inputs and outputs are captured.

Test execution: the connector

For Rhesis to run test cases against your application, it needs a way to call your code from outside your environment. The Rhesis SDK provides a persistent outbound WebSocket connection — your application opens it at startup and Rhesis can then invoke registered entry points whenever a test run fires. The connection is outbound from your app, so it works through firewalls and from local laptops without exposing a public URL.

You register an entry point with the @endpoint decorator (see SDK: Code-first testing). When a test run starts, Rhesis sends each test case's input down the WebSocket; your application runs the function locally and sends the output back up the same connection. The same call path serves single-turn test cases and multi-turn conversations driven by Penelope (our multi-turn conversation runner).

Both channels run from the same SDK in the same process: spans flow up the HTTP/OTLP channel; test commands flow down the WebSocket. Production traffic and test traffic produce traces in the same format, so the same evaluation metrics grade both.

REST API

Direct API access for custom integrations and CI/CD pipelines: manage test sets, trigger test runs, fetch results, and inspect traces programmatically. Language-agnostic — call from Python, TypeScript, Go, shell scripts, or anywhere else. OpenAPI spec available.

See Integration Docs for setup instructions.


Open source

MIT licensed. No plans to relicense core features. Enterprise version will live in ee/ folders and remain separate.

We built Rhesis because existing LLM testing tools didn't meet our needs. If you face the same challenges, contributions are welcome.


Contributing

See CONTRIBUTING.md for guidelines.

Ways to contribute: Fix bugs or add features · Contribute test sets for common failure modes · Improve documentation · Help others in Discord or GitHub discussions


Support


Security & privacy

We take data security seriously. See our Privacy Policy for details.

Telemetry: Rhesis collects basic, anonymized usage statistics to improve the product. No sensitive data is collected or shared with third parties.

  • Self-hosted: Opt out by setting OTEL_RHESIS_TELEMETRY_ENABLED=false
  • Cloud: Telemetry enabled as part of Terms & Conditions

Made with Rhesis logo in Potsdam, Germany 🇩🇪

Learn more at rhesis.ai

About

The testing platform for AI teams. Bring engineers, PMs, and domain experts together to generate tests, simulate (adversarial) conversations, and trace every failure to its root cause.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors