Skip to content

hengzzzhou/STS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Select-then-Solve (STS): Paradigm Routing as Inference-Time Optimization for LLM Agents

arXiv Project Page HF Dataset GitHub License

Official code for the paper "Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents".

Overview

We compare six inference-time reasoning paradigms (Direct, CoT, ReAct, Plan-Execute, Reflection, ReCode) across four frontier LLMs and ten benchmarks (~18k runs), and propose a lightweight embedding-based router that selects the best paradigm per task.

Select-then-Solve Pipeline
Figure 1: The Select-then-Solve pipeline. A lightweight router selects the best reasoning paradigm before the LLM answers.

Key findings:

  • Reasoning structure is sharply task-dependent: ReAct improves +44pp on GAIA, CoT degrades -15pp on HumanEval
  • No single paradigm dominates; oracle per-task selection beats the best fixed paradigm by 17.1pp
  • Our embedding-based router improves average accuracy from 47.6% to 53.1%, recovering up to 37% of the oracle gap
  • Zero-shot self-routing only works for the strongest model (GPT-5: 67.1%), weaker models fail

Project Structure

STS/
├── src/
│   ├── agent/           # BaseAgent + 6 strategy implementations
│   │   └── strategies/  # direct.py, cot.py, react.py, plan_execute.py, reflection.py, recode.py
│   ├── eval/            # Evaluation: metrics, code_eval, datasets
│   ├── predictor/       # Router: features, models, labeler, task_loader
│   └── tools/           # Web search + code execution tools
├── scripts/
│   ├── run_experiment.py      # Main experiment runner
│   ├── analyze.py             # Result aggregation
│   ├── train_predictor.py     # Train handcrafted-feature router
│   ├── train_router_v2.py     # Train embedding-based router + self-routing
│   └── evaluate_predictor.py  # Evaluate all routing methods
├── configs/
│   ├── paper_colm.yaml        # Config used for paper experiments
│   └── default_template.yaml  # Template config (add your API keys)
├── results/
│   └── aggregated/            # Pre-computed summary and detailed CSVs
└── data/                      # Benchmark data (from MiroThinker)

Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Configure API keys

cp configs/default_template.yaml configs/default.yaml
# Edit configs/default.yaml and add your API keys

3. Run experiments

# Run a single model + paradigm + dataset
PYTHONPATH=. python scripts/run_experiment.py --model gpt-5 --paradigm react --dataset gaia

# Run all experiments from paper config
PYTHONPATH=. python scripts/run_experiment.py --config configs/paper_colm.yaml --parallel --workers 10

4. Aggregate results

PYTHONPATH=. python scripts/analyze.py --results-dir results

5. Train and evaluate router

# Train handcrafted-feature predictors
PYTHONPATH=. python scripts/train_predictor.py --config configs/paper_colm.yaml

# Train embedding-based router + run self-routing
PYTHONPATH=. python scripts/train_router_v2.py

# Evaluate all methods
PYTHONPATH=. python scripts/evaluate_predictor.py --config configs/paper_colm.yaml

Paradigms

Paradigm Control Tools Description
Direct None No Free-form answer, no scaffold
CoT Instructed No Step-by-step reasoning
ReAct Orchestrated Yes Thought-action loop with tools
Plan-Execute Orchestrated Yes Plan first, then execute
Reflection Orchestrated Yes Answer, critique, revise
ReCode Substrate Yes Recursive code generation

Results

No Single Paradigm Wins

Best paradigm per dataset
Figure 2: Direct vs. best paradigm vs. oracle on each dataset (GPT-5). The best paradigm differs for every task type.

Router Comparison

Method GPT-5 Gemini Qwen3-Max Qwen3-30B Avg
Direct 60.3 55.5 49.8 24.9 47.6
Best-single 62.4 55.5 50.7 32.8 50.3
Embedding Router 64.2 61.0 54.6 32.8 53.1
Self-route 67.1 56.8 42.4 27.5 48.4
Oracle 72.9 73.4 72.5 56.8 68.9

Router comparison across models
Figure 3: The embedding router consistently outperforms Direct and Best-single across all models.

GPT-5 Success Rate Heatmap

GPT-5 heatmap
Figure 4: Success rates across all paradigms and datasets. No single row dominates all columns.

Benchmarks

The evaluation data is not randomly thrown together — it follows a deliberate protocol:

  • HLE & GAIA — adopt MiroThinker's standardized text-only splits to match their evaluation setting
  • AIME, SEAL, τ-bench — used as-is from the original benchmark releases (full sets, no subsampling)
  • HumanEval, MATH500, HotpotQA, NQ, MMLU — sampled 100 examples with seed=42 for reproducibility, due to token-cost constraints across 4 models × 6 paradigms × ~761 tasks = ~18k runs
Dataset Domain # Used Protocol
HumanEval Code generation 100 Sampled (seed=42)
MATH500 Math reasoning 100 Sampled (seed=42)
AIME Competition math 60 Full (2024 + 2025)
HotpotQA Multi-hop QA 100 Sampled (seed=42)
Natural Questions Factoid QA 100 Sampled (seed=42)
MMLU Multitask knowledge 100 Sampled (seed=42)
HLE Hard language exam 500 MiroThinker standardized
GAIA Agent QA 50 MiroThinker standardized (text-only val)
τ-bench Tool planning 51 Full
SEAL Safety evaluation Full

Data files are included in data/ and also released on Hugging Face: 🤗 henggg/paradigm-bench. See data/README.md for full details on sampling, sources, and task-type coverage.

Citation

@misc{zhou2026selectthensolveparadigmroutinginferencetime,
      title={Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents},
      author={Heng Zhou and Zelin Tan and Zhemeng Zhang and Yutao Fan and Yibing Lin and Li Kang and Xiufeng Song and Rui Li and Songtao Huang and Ao Yu and Yuchen Fan and Yanxu Chen and Kaixin Xu and Xiaohong Liu and Yiran Qin and Philip Torr and Chen Zhang and Zhenfei Yin},
      year={2026},
      eprint={2604.06753},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.06753},
}

License

MIT License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages