The τ²-bench leaderboard is now live at taubench.com!
- 📊 Interactive Rankings: Compare model performance across all domains
- 📱 Mobile-Friendly: View results on any device
- 🔍 Detailed Analysis: Explore trajectories and conversation flows
- 📥 Easy Submission: Submit your results directly through the interface
→ Visit the Leaderboard | → Submit Your Results
Each domain specifies:
- a policy that the agent must follow
- a set of tools that the agent can use
- a set of tasks to evaluate the agent's performance
- Optionally: A set of tools that the user simulator can use
Domains are:
mockairlineretailtelecom
All the information that an agent developer needs to build an agent for a domain can be accessed through the domain's API docs. See View domain documentation for more details.
- Clone the repository:
git clone https://github.com/sierra-research/tau2-bench
cd tau2-bench- Create a new environment (optional)
python -m venv .venv
source .venv/bin/activate- Install tau2
pip install -e .This will enable you to run the tau2 command.
Note: If you use pip install . (without -e), you'll need to set the TAU2_DATA_DIR environment variable to point to your data directory:
export TAU2_DATA_DIR=/path/to/your/tau2-bench/dataCheck your data directory setup:
After installation, you can verify that your data directory is correctly configured by running:
tau2 check-dataThis command will check if the data directory exists and print instructions if it is missing.
To remove all the generated files and the virtual environment, run:
make cleanWe use LiteLLM to manage LLM APIs, so you can use any LLM provider supported by LiteLLM.
To provide your API keys, copy .env.example as .env and edit it to include your API keys.
To run a test evaluation on only 5 tasks with 1 trial per task, run:
tau2 run \
--domain airline \
--agent-llm gpt-4.1 \
--user-llm gpt-4.1 \
--num-trials 1 \
--num-tasks 5Results will be saved in data/tau2/simulations/.
The tau2 command provides a unified interface for all functionality:
tau2 run \
--domain <domain> \
--agent-llm <llm_name> \
--user-llm <llm_name> \
--num-trials <trial_count> \
--task-ids <task_ids> \
--max-concurrency <concurrent_sims> \
...tau2 viewThis tool allows you to:
- Browse simulation files (in
data/tau2/simulations/) - View agent performance metrics
- View a particular simulation
- View task details
tau2 domain <domain>Visit http://127.0.0.1:8004/redoc to see the domain policy and API documentation.
tau2 check-dataThis command checks if your data directory is properly configured and all required files are present.
To submit your agent results to the τ²-bench leaderboard, you need to prepare a valid submission package that meets specific requirements.
Your trajectory runs must follow these constraints:
-
Complete domain coverage: Include results for all three domains:
retailairlinetelecom
-
Consistent model configuration: All trajectory files must use:
- The same agent LLM with identical arguments across all domains
- The same user simulator LLM with identical arguments across all domains
-
One result per domain: Each domain should appear exactly once in your submission
-
All tasks completed: Run evaluation on all tasks within each domain (don't use
--task-idsor--num-tasksfilters)
First, run your agent evaluation on all domains with consistent settings:
# Example: Run complete evaluation for all domains
tau2 run --domain retail --agent-llm gpt-4.1 --user-llm gpt-4.1 --num-trials 4 --save-to my_model_retail
tau2 run --domain airline --agent-llm gpt-4.1 --user-llm gpt-4.1 --num-trials 4 --save-to my_model_airline
tau2 run --domain telecom --agent-llm gpt-4.1 --user-llm gpt-4.1 --num-trials 4 --save-to my_model_telecomImportant: Use identical --agent-llm, --user-llm, and their arguments across all runs.
Use the submission preparation tool to create your leaderboard submission:
tau2 submit prepare data/tau2/simulations/my_model_*.json --output ./my_submissionThis command will:
- Verify all trajectory files are valid
- Check that submission requirements are met
- Compute performance metrics (Pass^k rates)
- Prompt for required metadata (model name, organization, contact email)
- Create a structured submission directory with:
submission.json: Metadata and metricstrajectories/: Your trajectory files
Before submitting, validate your submission package:
tau2 submit validate ./my_submissionThis will verify:
- All required files are present
- Trajectory files are valid
- Domain coverage is complete
- Model configurations are consistent
tau2 submit prepare data/tau2/simulations/my_model_*.json --output ./my_submission --no-verifytau2 submit verify-trajs data/tau2/simulations/my_model_*.jsonOnce your submission package is prepared and validated:
- Review the generated
submission.jsonfile - Follow the submission guidelines in web/leaderboard/public/submissions/README.md to create a Pull Request
- Keep your
trajectories/directory for reference
The leaderboard will display your model's Pass^k success rates (k=1,2,3,4) across all domains.
The @experiments/ directory contains experimental features and research code that extends beyond the core tau2 benchmark. This directory is designed for community contributions of innovative approaches, prototypes, and new features that are not part of the core evaluation framework.
- Purpose: Research code and experimental features
- Location:
src/experiments/ - Usage: Each experimental component has its own README with documentation
- Status: Experimental code is provided as-is and may not be fully tested or supported
For more details, see the experiments README.
telecom domain enables running ablation studies.
- Running an LLM in
no-usermode. In this mode, the LLM is given all the tools and the information upfront. Just choosellm_agent_soloas the agent anddummy_useras the user.
tau2 run \
--domain telecom \
--agent llm_agent_solo \
--agent-llm gpt-4.1 \
--user dummy_user \
...- Running an LLM in
oracle-planmode. In this mode, the LLM is given an oracle plan ahead of time alleviating the need for action planning. Just choosellm_agent_gtas the agent.
tau2 run \
--domain telecom \
--agent llm_agent_gt \
--agent-llm gpt-4.1 \
--user-llm gpt-4.1 \
...To test the impact of policy format, we provide an additional "workflow" policy for the telecom domain.
To run using this policy, use the telecom-workflow domain.
tau2 run \
--domain telecom-workflow \
--agent-llm gpt-4.1 \
--user-llm gpt-4.1 \
...For all the details see the domains README.
- Code is located in
src/tau2/domains/ - Data is located in
data/tau2/domains/ - Each domain has its own configuration and task definitions
Run the following command to see the domain policy and API documentation.
tau2 env <domain>Then visit http://127.0.0.1:8004/redoc
An interactive command-line interface for directly querying and testing domain environments. Features:
- Interactive query interface with domain-specific tools
- Support for multiple domains (airline, mock, etc.)
- Session management with history
To use:
make env-cliAvailable commands:
:q- quit the program:d- change domain:n- start new session (clears history)
Example usage:
$ make env-cli
Welcome to the Environment CLI!
Connected to airline domain.
Query (:n new session, :d change domain, :q quit)> What flights are available from SF to LA tomorrow?
Assistant: Let me check the flight availability for you...
[Flight details will appear here]The Environment CLI is useful for:
- Testing domain tools and queries
- Debugging environment responses
- Exploring available domain functionality
- Quick domain interaction without starting the full server stack
To run the test suite use the command
make testTo configure the framework, see the config file.
LLM call caching is disabled by default.
To enable LLM calls caching:
- Make sure redis is running.
- Update the redis config in config.py if necessary.
- Set LLM_CACHE_ENABLED to True in config.py
For local or remote agent evaluation, see our agent developer guide.
We welcome contributions to τ²-bench! Whether you're fixing bugs, adding new features, creating new domains, or contributing experimental research code, please see our Contributing Guide for detailed guidelines on:
- Opening issues before starting work
- Branch naming conventions and development workflow
- Code quality standards and testing requirements
- Pull request guidelines for clean, reviewable contributions
- Domain and experimental contributions specific guidelines
For experimental features and research code, check out the @experiments/ directory.
sequenceDiagram
participant O as Orchestrator
participant A as Agent
participant U as UserSimulator
participant E as Environment
Note over O: Initialize(task)
rect rgb(100, 150, 150)
O->>A: get_init_state_info(message_history)
A->>O: agent_state_info
O->>U: get_init_state_info(message_history)
U->>O: user_state_info
O->>E: set_state(initialization_data, initialization_actions, message_history)
end
Note over O: Start simulation
loop Pass messages between Agent, User, and Environment
alt Agent/Env to User
rect rgb(200, 150, 150)
O->>U: generate_next_message(msg, user_state_info)
U-->>O: (user_msg, user_state_info)
end
Note over O: Check if user_msg is STOP
else User/Env to Agent
rect rgb(100, 200, 100)
O->>A: generate_next_message(msg, agent_state_info)
A-->>O: (assistant_msg, agent_state_info)
Note over O: Check if too many errors
end
else User/Agent to Environment
rect rgb(150, 150, 200)
O->>E: get_response(tool_call)
E-->>O: tool_message
end
end
Note over O: Check if max turns reached.
end
Note over O: Return simulation run
@misc{barres2025tau2,
title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
year={2025},
eprint={2506.07982},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.07982},
}