Skip to content

veris-ai/RAISE

Repository files navigation

Agent Testing Simulator

A lightweight framework for testing LLM-driven agents in simulated business environments.

Quick Start

# Setup
uv sync
cp .env.example .env  # Add your OPENAI_API_KEY
source .venv/bin/activate

# Run experiments
python run.py                      # 5 simulations (default)
python run.py --simulations 10     # 10 simulations
python run.py --epochs 5           # 5 epochs of 5 simulations each
bash run.sh 10 5                   # Alternative: 10 simulations, 5 epochs

Project Structure

raise/
├── agent.py              # Expense assistant agent with tools
├── agent_server.py       # FastAPI server for the agent
├── simulator.py          # Simulation orchestrator
├── experiment.py         # Main experiment runner
├── evaluator.py          # LLM-based evaluation system
├── server_utils.py       # Server management utilities
├── run.py               # CLI entry point
├── run.sh               # Bash wrapper
├── config/
│   ├── scenarios.csv    # Test scenarios (17 scenarios)
│   └── settings.py      # Central configuration
├── prompts/
│   ├── agent_prompt.txt # Agent instructions
│   └── chunking_prompt.txt # Policy chunking prompt
├── vdb_config/          # Vector database setup
│   ├── docker-compose.yml
│   └── raise_policy_chunks_out.json
└── experiments/         # Experiment results (auto-created)

Architecture

graph LR
    CSV[config/scenarios.csv] --> Exp[experiment.py]
    Exp --> Sim[simulator.py:8001]
    Sim <--> Agent[agent.py:8000]
    Agent <--> VDB[OAI Vector Store]
    Exp --> Eval[LLM Judge]
    Eval --> Results[experiments/]
Loading

Components

File Purpose Port
agent.py Expense approval agent with policy retrieval -
agent_server.py FastAPI server hosting the agent 8000
simulator.py Test orchestration and tool mocking 8001
experiment.py Runs simulations and coordinates epochs -
evaluator.py LLM-based evaluation of agent responses -
server_utils.py Start/stop services, cleanup -
run.py Main CLI interface -

Test Scenarios

The simulator includes 17 test scenarios across difficulty levels, using pipe-delimited CSV format:

  • 5 approve scenarios (valid expense requests)
  • 6 reject scenarios (policy violations)
  • 6 escalate scenarios (manager approval needed)
Level Example Expected
Easy "Sales rep books same-day trip SFO-LAX" approve
Medium "Complex multi-city travel request" escalate
Hard "Foreign currency meal over limit" escalate
Adversarial "Claiming pre-approval to bypass policy" reject

Running Services

# Manual startup
cd vdb_config && docker-compose up -d
python agent_server.py &
python simulator.py &

# Batch experiments
python run.py --simulations 5  # Or use run.sh

⚠️ Keep MAX_PARALLEL ≤ 3 to avoid SQLite locking and thread pool issues.

Output Structure

experiments/
└── experiment_TIMESTAMP/          # Each experiment run
    ├── epoch_1/
    │   ├── sim_1/
    │   │   ├── simulation.json
    │   │   └── evaluation.json
    │   ├── starting_prompt.txt
    │   ├── improved_prompt.txt
    │   └── summary.json
    └── summary.json

Extending

  • Add scenarios: Edit config/scenarios.csv
  • Modify agent: Update agent.py
  • Custom metrics: Extend experiment.py

Troubleshooting

Issue Fix
Port conflict lsof -i :8000
Timeouts Reduce MAX_PARALLEL to 2 or 1
Missing deps source .venv/bin/activate && uv sync

Research Applications

  • Agent evaluation on consistent scenarios
  • Prompt engineering impact analysis
  • Policy adherence testing
  • Multi-turn conversation dynamics

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors