A lightweight framework for testing LLM-driven agents in simulated business environments.
# Setup
uv sync
cp .env.example .env # Add your OPENAI_API_KEY
source .venv/bin/activate
# Run experiments
python run.py # 5 simulations (default)
python run.py --simulations 10 # 10 simulations
python run.py --epochs 5 # 5 epochs of 5 simulations each
bash run.sh 10 5 # Alternative: 10 simulations, 5 epochsraise/
├── agent.py # Expense assistant agent with tools
├── agent_server.py # FastAPI server for the agent
├── simulator.py # Simulation orchestrator
├── experiment.py # Main experiment runner
├── evaluator.py # LLM-based evaluation system
├── server_utils.py # Server management utilities
├── run.py # CLI entry point
├── run.sh # Bash wrapper
├── config/
│ ├── scenarios.csv # Test scenarios (17 scenarios)
│ └── settings.py # Central configuration
├── prompts/
│ ├── agent_prompt.txt # Agent instructions
│ └── chunking_prompt.txt # Policy chunking prompt
├── vdb_config/ # Vector database setup
│ ├── docker-compose.yml
│ └── raise_policy_chunks_out.json
└── experiments/ # Experiment results (auto-created)
graph LR
CSV[config/scenarios.csv] --> Exp[experiment.py]
Exp --> Sim[simulator.py:8001]
Sim <--> Agent[agent.py:8000]
Agent <--> VDB[OAI Vector Store]
Exp --> Eval[LLM Judge]
Eval --> Results[experiments/]
| File | Purpose | Port |
|---|---|---|
agent.py |
Expense approval agent with policy retrieval | - |
agent_server.py |
FastAPI server hosting the agent | 8000 |
simulator.py |
Test orchestration and tool mocking | 8001 |
experiment.py |
Runs simulations and coordinates epochs | - |
evaluator.py |
LLM-based evaluation of agent responses | - |
server_utils.py |
Start/stop services, cleanup | - |
run.py |
Main CLI interface | - |
The simulator includes 17 test scenarios across difficulty levels, using pipe-delimited CSV format:
- 5 approve scenarios (valid expense requests)
- 6 reject scenarios (policy violations)
- 6 escalate scenarios (manager approval needed)
| Level | Example | Expected |
|---|---|---|
| Easy | "Sales rep books same-day trip SFO-LAX" | approve |
| Medium | "Complex multi-city travel request" | escalate |
| Hard | "Foreign currency meal over limit" | escalate |
| Adversarial | "Claiming pre-approval to bypass policy" | reject |
# Manual startup
cd vdb_config && docker-compose up -d
python agent_server.py &
python simulator.py &
# Batch experiments
python run.py --simulations 5 # Or use run.shexperiments/
└── experiment_TIMESTAMP/ # Each experiment run
├── epoch_1/
│ ├── sim_1/
│ │ ├── simulation.json
│ │ └── evaluation.json
│ ├── starting_prompt.txt
│ ├── improved_prompt.txt
│ └── summary.json
└── summary.json
- Add scenarios: Edit
config/scenarios.csv - Modify agent: Update
agent.py - Custom metrics: Extend
experiment.py
| Issue | Fix |
|---|---|
| Port conflict | lsof -i :8000 |
| Timeouts | Reduce MAX_PARALLEL to 2 or 1 |
| Missing deps | source .venv/bin/activate && uv sync |
- Agent evaluation on consistent scenarios
- Prompt engineering impact analysis
- Policy adherence testing
- Multi-turn conversation dynamics
MIT