An LLM-powered tree search engine for multi-turn conversation optimization.
DTS explores conversation strategies in parallel, simulates diverse user reactions, scores trajectories with multi-judge consensus, and prunes underperformersβfinding optimal dialogue paths that single-shot LLM responses miss.
Real-time tree exploration with strategy scoring, conversation playback, and detailed evaluation breakdowns
- Why DTS?
- How It Works
- System Architecture
- Prerequisites & API Keys
- Installation
- Quick Start
- Configuration
- Deep Research Integration
- API Reference
- Frontend Visualizer
- Project Structure
- Token Usage & Cost Management
- Troubleshooting
- License
Standard LLMs generate responses one turn at a time, optimizing locally without considering long-term conversation outcomes. This leads to:
- Myopic responses that sound good but lead to dead ends
- Single-path thinking that misses better strategic approaches
- Fragile strategies that fail when users respond unexpectedly
DTS solves this by treating conversation as a tree search problem:
- Explore multiple strategies in parallel (not just one response)
- Simulate diverse user reactions (skeptical, enthusiastic, confused, etc.)
- Score complete trajectories against your goal
- Prune bad paths early to focus computation on promising directions
The result: dialogue strategies that are robust, goal-oriented, and tested against varied user behaviors.
DTS implements a parallel beam search with the following loop:
For each round:
1. Generate N diverse conversation strategies
2. For each strategy, simulate K user intent variants
3. Roll out multi-turn conversations for each branch
4. Score all trajectories with 3 independent judges
5. Prune branches below threshold (median vote)
6. Backpropagate scores up the tree
7. Repeat with surviving branches
Unlike traditional single-path generation, DTS maintains multiple conversation branches simultaneously:
graph TD
subgraph Round 1
Root[User Message] --> S1[Strategy: Empathetic]
Root --> S2[Strategy: Direct]
Root --> S3[Strategy: Socratic]
end
subgraph Round 2
S1 --> S1I1[Intent: Cooperative]
S1 --> S1I2[Intent: Skeptical]
S2 --> S2I1[Intent: Cooperative]
S2 --> S2I2[Intent: Resistant]
end
subgraph Scoring
S1I1 --> J1((Judge 1))
S1I1 --> J2((Judge 2))
S1I1 --> J3((Judge 3))
J1 & J2 & J3 --> M{Median Vote}
end
M -->|Score β₯ 6.5| Keep[Keep Branch]
M -->|Score < 6.5| Prune[Prune Branch]
Branches are color-coded by score: green (passing), yellow (borderline), red (pruned)
Key parameters:
init_branches: Number of initial strategies (default: 6)turns_per_branch: Conversation depth per branch (default: 5)max_concurrency: Parallel LLM calls (default: 16)
Most dialogue systems assume a single "happy path" user response. DTS can stress-test strategies against diverse user personas when enabled.
User Variability Mode:
user_variability=False(default): Uses a fixed "healthily critical + engaged" persona for consistent, realistic testinguser_variability=True: Generates diverse user intents for robustness testing across user types
When variability is enabled, possible user personas include:
| Emotional Tone | Cognitive Stance | Example Behavior |
|---|---|---|
engaged |
accepting |
Cooperative, follows suggestions |
skeptical |
questioning |
Asks for evidence, challenges claims |
confused |
exploring |
Needs clarification, misunderstands |
resistant |
challenging |
Pushes back, disagrees |
anxious |
withdrawing |
Hesitant, wants to end conversation |
Each strategy can fork into K intent variants (configurable via user_intents_per_branch), creating branches that prove robustness across user types.
UserIntent structure:
UserIntent(
id="skeptical_questioner",
label="Skeptical Questioner",
description="Demands evidence before accepting claims",
emotional_tone="skeptical", # How user feels
cognitive_stance="questioning", # How user thinks
)Each trajectory is evaluated by 3 independent LLM judges. Scores are aggregated via median voting (robust to outlier judges):
Judge 1: 7.2 ββ
Judge 2: 6.8 ββΌββΊ Median: 7.2 ββΊ Pass (β₯ 6.5)
Judge 3: 8.1 ββ
Why 3 judges?
- Single judge = high variance, easily gamed
- Median of 3 = robust to one outlier
- Majority vote determines pass/fail (2 of 3 must pass)
Scoring criteria (each 0-1, summed to 0-10):
- Goal achievement
- User need addressed
- Forward progress
- Clarity & coherence
- Appropriate tone
- Information accuracy
- Handling objections
- Building rapport
- Conversation flow
- Strategic effectiveness
|
High-Scoring Branch (9.2/10) |
Pruned Branch (4.1/10) |
Left: A successful trajectory with detailed strengths. Right: A pruned branch showing weaknesses and why it failed.
DTS supports two evaluation modes:
| Mode | How It Works | Best For |
|---|---|---|
| Comparative | Sibling branches force-ranked against each other | Sharp discrimination, finding the single best path |
| Absolute | Each branch scored independently (0-10) | Early pruning, filtering obviously bad paths |
Comparative mode (default):
Input: [Strategy A, Strategy B, Strategy C] (siblings)
Output: A=7.5, B=6.0, C=4.5 (forced ranking with 1.5-point gaps)
Absolute mode:
Input: Strategy A (evaluated alone)
Output: 3 judges β [7.2, 6.8, 8.1] β Median: 7.2
Use scoring_mode="comparative" when you need the best single answer.
Use scoring_mode="absolute" when filtering many branches quickly.
sequenceDiagram
participant User
participant FE as Frontend (HTML/JS)
participant API as FastAPI WebSocket
participant ENG as DTS Engine
participant LLM as OpenRouter/OpenAI
participant RES as Firecrawl + Tavily
User->>FE: Configure & Start Search
FE->>API: WebSocket Connect
API->>ENG: Initialize DTSEngine
opt Deep Research Enabled
ENG->>RES: Research Query
RES-->>ENG: Domain Context
end
loop For Each Round
ENG->>LLM: Generate Strategies
LLM-->>ENG: N Strategies
loop For Each Branch
ENG->>LLM: Generate User Intents
ENG->>LLM: Simulate Conversation
ENG->>LLM: Judge Trajectory (3x)
end
ENG->>API: Emit Events (node_added, scored, pruned)
API-->>FE: Stream Updates
FE->>User: Update Visualization
end
ENG->>API: Complete with Best Path
FE->>User: Show Results
| Component | Location | Purpose |
|---|---|---|
| DTSEngine | backend/core/dts/engine.py |
Main orchestrator, runs expandβscoreβprune loop |
| StrategyGenerator | backend/core/dts/components/generator.py |
Creates strategies and user intents |
| ConversationSimulator | backend/core/dts/components/simulator.py |
Runs multi-turn dialogue rollouts |
| TrajectoryEvaluator | backend/core/dts/components/evaluator.py |
Multi-judge scoring with median aggregation |
| DeepResearcher | backend/core/dts/components/researcher.py |
GPT-Researcher integration for context |
| DialogueTree | backend/core/dts/tree.py |
Tree data structure with backpropagation |
| LLM Client | backend/llm/client.py |
Provider-agnostic OpenAI-compatible wrapper |
| Service | Environment Variable | Required | Purpose |
|---|---|---|---|
| LLM Provider | OPENROUTER_API_KEY |
Yes | Strategy generation, simulation, and judging |
| Web Scraping | FIRECRAWL_API_KEY |
For Deep Research | Scrapes web pages for research context |
| Web Search | TAVILY_API_KEY |
For Deep Research | Searches the web for relevant sources |
-
OpenRouter (recommended): openrouter.ai/keys
- Works with 100+ models (GPT-4, Claude, Gemini, open-source)
- Pay-per-token, no subscriptions
- Set
OPENAI_BASE_URL=https://openrouter.ai/api/v1
-
Firecrawl: firecrawl.dev
- Required for
deep_research=True - Handles JavaScript-rendered pages, anti-bot bypass
- Required for
-
Tavily: tavily.com
- Required for
deep_research=True - AI-optimized web search API
- Required for
Note: Deep Research features require both Firecrawl and Tavily keys. Without them, set
deep_research=Falsein your configuration.
| Variable | Default | Description |
|---|---|---|
OPENAI_BASE_URL |
https://openrouter.ai/api/v1 |
LLM API endpoint |
LLM_NAME |
minimax/minimax-m2.1 |
Default model for all phases |
FAST_LLM |
openrouter:minimax/minimax-m2.1 |
Fast model for research |
SMART_LLM |
openrouter:minimax/minimax-m2.1 |
Smart model for complex tasks |
STRATEGIC_LLM |
openrouter:minimax/minimax-m2.1 |
Strategic reasoning model |
LLM_TIMEOUT |
120 |
Request timeout in seconds |
LLM_MAX_RETRIES |
2 |
Retry attempts on failure |
MAX_CONCURRENCY |
16 |
Parallel LLM call limit |
Note: The default model
minimax/minimax-m2.1is chosen for its excellent price/performance ratio. You can use any OpenRouter-compatible model for any task by overriding the model parameters in your configuration.
Requires Python 3.11+
# Clone the repository
git clone https://github.com/MVPandey/DTS.git
cd DTS
# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
uv pip install -e .Create a .env file in the project root (see .env.example):
# Required - API Keys
OPENROUTER_API_KEY=sk-or-v1-your-openrouter-key
TAVILY_API_KEY=tvly-your-tavily-key
FIRECRAWL_API_KEY=fc-your-firecrawl-key
# LLM models (minimax-m2.1 recommended for price/performance)
FAST_LLM=openrouter:minimax/minimax-m2.1
SMART_LLM=openrouter:minimax/minimax-m2.1
STRATEGIC_LLM=openrouter:minimax/minimax-m2.1
SMART_TOKEN_LIMIT=32000
# Deep research parameters
DEEP_RESEARCH_BREADTH=3
DEEP_RESEARCH_DEPTH=2
DEEP_RESEARCH_CONCURRENCY=4
# Report comprehensiveness (higher = more detailed)
TOTAL_WORDS=12000
MAX_SUBTOPICS=8
MAX_ITERATIONS=5
MAX_SEARCH_RESULTS=10
REPORT_FORMAT=markdownModel flexibility: You can use any model available on OpenRouter for any task. The default
minimax/minimax-m2.1offers an excellent balance of cost and capability, but feel free to swap inanthropic/claude-3-opus,openai/gpt-4o, or any other model for specific phases.
The easiest way to start the server:
Unix/macOS/Linux:
# Start with Docker
./scripts/start_server.sh
# Start in development mode (hot reload)
./scripts/start_server.sh --dev
# Start without Docker (local Python)
./scripts/start_server.sh --local
# Stop the server
./scripts/start_server.sh --downWindows:
REM Start with Docker
scripts\start_server.bat
REM Start in development mode (hot reload)
scripts\start_server.bat --dev
REM Start without Docker (local Python)
scripts\start_server.bat --local
REM Stop the server
scripts\start_server.bat --down# Start the server (production)
docker-compose up -d dts-server
# Start in development mode with hot reload
docker-compose --profile dev up dts-server-dev
# View logs
docker-compose logs -f
# Stop and remove containers
docker-compose down
# Rebuild after code changes
docker-compose up -d --build dts-server# Activate virtual environment
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Set PYTHONPATH
export PYTHONPATH=$(pwd) # Windows: set PYTHONPATH=%cd%
# Start server
uvicorn backend.api.server:app --host localhost --port 8000 --reload --log-level infoThe project includes VSCode launch configurations for debugging:
- Open the project in VSCode
- Go to Run and Debug (Ctrl+Shift+D / Cmd+Shift+D)
- Select a configuration from the dropdown:
| Configuration | Description |
|---|---|
| Debug DTS Server | Start the API server with debugger attached |
| Debug Current Python File | Debug the currently open file |
| Attach to Remote Debugpy | Attach to a running debugpy server (port 5678) |
- Press F5 or click the green play button
VSCode Launch Configuration (.vscode/launch.json):
{
"name": "Debug DTS Server",
"type": "debugpy",
"request": "launch",
"module": "uvicorn",
"args": [
"backend.api.server:app",
"--host", "localhost",
"--port", "8000",
"--reload",
"--log-level", "info"
],
"env": { "PYTHONPATH": "${workspaceFolder}" },
"envFile": "${workspaceFolder}/.env"
}For programmatic use without the web interface:
import asyncio
from backend.core.dts import DTSConfig, DTSEngine
from backend.llm.client import LLM
from backend.utils.config import config
async def main():
# Initialize LLM client (uses minimax-m2.1 by default for best price/performance)
llm = LLM(
api_key=config.openrouter_api_key,
base_url=config.openai_base_url,
model="minimax/minimax-m2.1", # Or any OpenRouter model
)
# Configure the search
dts_config = DTSConfig(
goal="Negotiate a 15% discount on enterprise software",
first_message="Hi, I'd like to discuss our renewal pricing.",
init_branches=6, # 6 initial strategies
turns_per_branch=5, # 5-turn conversations
user_intents_per_branch=3, # Fork into 3 user persona variants
user_variability=False, # Use fixed "healthily critical" persona (default)
scoring_mode="comparative", # Force-rank siblings
prune_threshold=6.5, # Minimum score to survive
deep_research=True, # Enable research context
)
# Run the search
engine = DTSEngine(llm=llm, config=dts_config)
result = await engine.run(rounds=2)
# Output results
print(f"Best Score: {result.best_score:.1f}/10")
print(f"Branches Explored: {len(result.all_nodes)}")
print(f"Branches Pruned: {result.pruned_count}")
# Save full results
result.save_json("output.json")
if __name__ == "__main__":
asyncio.run(main())Or run the included example:
python main.pyOnce the server is running:
| Resource | URL |
|---|---|
| API | http://localhost:8000 |
| API Docs (Swagger) | http://localhost:8000/docs |
| API Docs (ReDoc) | http://localhost:8000/redoc |
| Frontend | Open frontend/index.html in your browser |
| Parameter | Type | Default | Description |
|---|---|---|---|
goal |
str |
required | What you want the conversation to achieve |
first_message |
str |
required | Opening user message |
init_branches |
int |
6 |
Number of initial strategies to generate |
turns_per_branch |
int |
5 |
Conversation depth (assistant+user turns) |
user_intents_per_branch |
int |
3 |
User persona variants per strategy |
user_variability |
bool |
False |
Generate diverse user intents. When False, uses fixed "healthily critical + engaged" persona |
scoring_mode |
str |
"comparative" |
"comparative" or "absolute" |
prune_threshold |
float |
6.5 |
Minimum score to survive (0-10 scale) |
keep_top_k |
int | None |
None |
Hard cap on survivors per round |
min_survivors |
int |
1 |
Minimum branches to keep (floor) |
deep_research |
bool |
False |
Enable GPT-Researcher integration |
max_concurrency |
int |
16 |
Parallel LLM call limit |
temperature |
float |
0.7 |
Generation temperature |
judge_temperature |
float |
0.3 |
Judge temperature (lower = more consistent) |
reasoning_enabled |
bool |
False |
Enable reasoning tokens for LLM calls (increases cost but may improve quality) |
provider |
str | None |
None |
Provider preference for OpenRouter (e.g., "Fireworks") |
DTS integrates GPT-Researcher to gather domain context before generating strategies.
graph LR
A[Goal + First Message] --> B[Query Distillation]
B --> C[Web Search via Tavily]
C --> D[Page Scraping via Firecrawl]
D --> E[Research Report]
E --> F[Strategy Generation]
E --> G[Judge Evaluation]
- Query Distillation: LLM converts goal into focused research query
- Web Search: Tavily finds relevant sources
- Scraping: Firecrawl extracts content (handles JS, anti-bot)
- Report: GPT-Researcher synthesizes findings
- Injection: Report fed to strategy generator and judges
DTSConfig(
goal="Explain quantum computing to a 10-year-old",
first_message="What's quantum computing?",
deep_research=True, # Enable research
)Research results are cached by SHA256(goal + first_message) in .cache/research/. Subsequent runs with the same inputs skip the research phase.
| Service | Purpose | Get Key |
|---|---|---|
| Firecrawl | Web page scraping | firecrawl.dev |
| Tavily | Web search | tavily.com |
Cost Note: Deep research adds external API costs beyond LLM tokens. Monitor usage during development.
URL: ws://localhost:8000/ws
{
"type": "start_search",
"config": {
"goal": "Your conversation goal",
"first_message": "Opening user message",
"init_branches": 6,
"turns_per_branch": 5,
"user_intents_per_branch": 3,
"scoring_mode": "comparative",
"prune_threshold": 6.5,
"rounds": 2,
"deep_research": false
}
}The server emits real-time events as the search progresses:
| Event | Description | Data |
|---|---|---|
search_started |
Search initialized | {goal, config} |
phase |
Lifecycle update | {phase, message} |
strategy_generated |
New strategy created | {tagline, description} |
intent_generated |
User intent created | {label, emotional_tone} |
research_log |
Deep research progress | {message} |
round_started |
Round begins | {round, total_rounds} |
node_added |
Branch created | {id, strategy, intent} |
node_updated |
Branch scored | {id, score, passed} |
nodes_pruned |
Branches removed | {ids, reasons} |
token_update |
Token usage snapshot | {totals} |
complete |
Search finished | {best_node, all_nodes} |
error |
Error occurred | {message} |
FastAPI auto-generates OpenAPI docs at http://localhost:8000/docs.
The included frontend (frontend/index.html) provides real-time visualization:
Configuration panel with basic parameters, deep research toggle, and user variability settings
- Configuration Panel: Set goal, branches, turns, rounds, and model settings
- Deep Research Toggle: Enable/disable GPT-Researcher integration
- Reasoning Mode: Auto-detect or manually set reasoning effort
- User Variability Toggle: Switch between fixed "healthily critical" persona (default) or diverse user personas for robustness testing
- Live Progress: Watch strategies generate and branches expand
- Branch Browser: Explore all trajectories with full transcripts
- Score Details: See individual judge scores and critiques
- Token Tracking: Monitor costs by phase and model
- Export: Download results as JSON
-
Start the API server:
uvicorn backend.api.server:app --port 8000
-
Open
frontend/index.htmlin a browser -
Configure your search and click "Start Search"
DTS/
βββ backend/
β βββ api/
β β βββ server.py # FastAPI WebSocket server
β β βββ schemas.py # Pydantic request/response models
β βββ core/
β β βββ dts/
β β β βββ engine.py # Main DTSEngine orchestrator
β β β βββ config.py # DTSConfig dataclass
β β β βββ types.py # Core data models
β β β βββ tree.py # DialogueTree structure
β β β βββ aggregator.py # Median vote aggregation
β β β βββ retry.py # Shared retry logic
β β β βββ components/
β β β βββ generator.py # Strategy & intent generation
β β β βββ simulator.py # Conversation rollouts
β β β βββ evaluator.py # Multi-judge scoring
β β β βββ researcher.py # GPT-Researcher integration
β β βββ prompts.py # All prompt templates
β βββ llm/
β β βββ client.py # OpenAI-compatible LLM client
β β βββ types.py # Message, Completion, Usage
β β βββ errors.py # Custom exception types
β β βββ tools.py # Tool calling support
β βββ services/
β β βββ search_service.py # API service layer
β βββ utils/
β βββ config.py # Pydantic settings from .env
βββ frontend/
β βββ index.html # Single-page visualizer
β βββ app.js # WebSocket client & UI logic
βββ scripts/
β βββ start_server.sh # Unix/macOS start script
β βββ start_server.bat # Windows start script
βββ gpt-researcher/ # GPT-Researcher submodule
βββ .vscode/
β βββ launch.json # VSCode debug configurations
βββ Dockerfile # Container image definition
βββ docker-compose.yml # Multi-container orchestration
βββ main.py # Example script
βββ pyproject.toml # Project metadata & dependencies
βββ CLAUDE.md # Developer instructions
βββ README.md # This file
# Strategy for conversation approach
Strategy(tagline="Empathetic Listener", description="Validate feelings first...")
# User persona for intent forking
UserIntent(
id="skeptic",
label="Skeptical Questioner",
emotional_tone="skeptical",
cognitive_stance="questioning",
)
# Tree node with conversation state
DialogueNode(
id="uuid",
strategy=Strategy(...),
user_intent=UserIntent(...),
messages=[Message(role="user", content="...")],
stats=NodeStats(aggregated_score=7.2),
)
# Final result
DTSRunResult(
best_node_id="uuid",
best_score=8.1,
best_messages=[...],
all_nodes=[...],
token_usage={...},
)DTS is token-intensive due to parallel exploration. A typical run involves:
Cost Formula β Branches Γ Intents Γ Turns Γ (Generation + 3ΓJudging)
Example: 6 branches Γ 3 intents Γ 5 turns Γ 4 calls = 360 LLM calls
| Phase | % of Tokens | Purpose |
|---|---|---|
| Strategy Generation | ~10% | Creating initial approaches |
| Intent Generation | ~5% | Generating user personas |
| User Simulation | ~30% | Simulating user responses |
| Assistant Simulation | ~25% | Generating assistant replies |
| Judging | ~30% | 3 judges per trajectory |
- Raise
prune_threshold: Aggressively cull bad branches (6.5 β 7.0) - Set
keep_top_k: Hard cap on survivors (e.g.,keep_top_k=3) - Lower
turns_per_branch: Shorter conversations (5 β 3) - Disable forking: Set
user_intents_per_branch=1 - Use fast models: Cheaper models for simulation, expensive for judging
- Fewer rounds: Start with 1 round, add more if needed
The engine tracks tokens per phase and model:
result = await engine.run(rounds=2)
print(result.token_usage)
# {
# "total_input_tokens": 45000,
# "total_output_tokens": 12000,
# "total_cost_usd": 0.42,
# "by_phase": {...},
# "by_model": {...},
# }| Error | Cause | Solution |
|---|---|---|
AuthenticationError |
Invalid API key | Check OPENROUTER_API_KEY in .env |
RateLimitError |
Too many requests | Lower max_concurrency, add delays |
ContextLengthError |
Conversation too long | Reduce turns_per_branch |
ValueError: FIRECRAWL_API_KEY required |
Missing research key | Add key or set deep_research=False |
JSONParseError |
LLM returned invalid JSON | Retry usually fixes; check model quality |
ServerError (5xx) |
Provider issues | Automatic retry with backoff |
Enable verbose logging:
DEBUG=true
LOGGING_LEVEL=DEBUGIf you don't have Firecrawl/Tavily keys:
DTSConfig(
goal="...",
first_message="...",
deep_research=False, # Disable research
)Apache License 2.0 β see LICENSE.
Contributions welcome! Please read CONTRIBUTING.md before submitting PRs.
- GPT-Researcher for deep research capabilities
- OpenRouter for multi-model API access
- Firecrawl for web scraping
- Tavily for AI-optimized search