Select-then-Solve (STS): Paradigm Routing as Inference-Time Optimization for LLM Agents

Official code for the paper "Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents".

Overview

We compare six inference-time reasoning paradigms (Direct, CoT, ReAct, Plan-Execute, Reflection, ReCode) across four frontier LLMs and ten benchmarks (~18k runs), and propose a lightweight embedding-based router that selects the best paradigm per task.

Figure 1: The Select-then-Solve pipeline. A lightweight router selects the best reasoning paradigm before the LLM answers.

Key findings:

Reasoning structure is sharply task-dependent: ReAct improves +44pp on GAIA, CoT degrades -15pp on HumanEval
No single paradigm dominates; oracle per-task selection beats the best fixed paradigm by 17.1pp
Our embedding-based router improves average accuracy from 47.6% to 53.1%, recovering up to 37% of the oracle gap
Zero-shot self-routing only works for the strongest model (GPT-5: 67.1%), weaker models fail

Project Structure

STS/
├── src/
│   ├── agent/           # BaseAgent + 6 strategy implementations
│   │   └── strategies/  # direct.py, cot.py, react.py, plan_execute.py, reflection.py, recode.py
│   ├── eval/            # Evaluation: metrics, code_eval, datasets
│   ├── predictor/       # Router: features, models, labeler, task_loader
│   └── tools/           # Web search + code execution tools
├── scripts/
│   ├── run_experiment.py      # Main experiment runner
│   ├── analyze.py             # Result aggregation
│   ├── train_predictor.py     # Train handcrafted-feature router
│   ├── train_router_v2.py     # Train embedding-based router + self-routing
│   └── evaluate_predictor.py  # Evaluate all routing methods
├── configs/
│   ├── paper_colm.yaml        # Config used for paper experiments
│   └── default_template.yaml  # Template config (add your API keys)
├── results/
│   └── aggregated/            # Pre-computed summary and detailed CSVs
└── data/                      # Benchmark data (from MiroThinker)

Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Configure API keys

cp configs/default_template.yaml configs/default.yaml
# Edit configs/default.yaml and add your API keys

3. Run experiments

# Run a single model + paradigm + dataset
PYTHONPATH=. python scripts/run_experiment.py --model gpt-5 --paradigm react --dataset gaia

# Run all experiments from paper config
PYTHONPATH=. python scripts/run_experiment.py --config configs/paper_colm.yaml --parallel --workers 10

4. Aggregate results

PYTHONPATH=. python scripts/analyze.py --results-dir results

5. Train and evaluate router

# Train handcrafted-feature predictors
PYTHONPATH=. python scripts/train_predictor.py --config configs/paper_colm.yaml

# Train embedding-based router + run self-routing
PYTHONPATH=. python scripts/train_router_v2.py

# Evaluate all methods
PYTHONPATH=. python scripts/evaluate_predictor.py --config configs/paper_colm.yaml

Paradigms

Paradigm	Control	Tools	Description
Direct	None	No	Free-form answer, no scaffold
CoT	Instructed	No	Step-by-step reasoning
ReAct	Orchestrated	Yes	Thought-action loop with tools
Plan-Execute	Orchestrated	Yes	Plan first, then execute
Reflection	Orchestrated	Yes	Answer, critique, revise
ReCode	Substrate	Yes	Recursive code generation

Results

No Single Paradigm Wins

Figure 2: Direct vs. best paradigm vs. oracle on each dataset (GPT-5). The best paradigm differs for every task type.

Router Comparison

Method	GPT-5	Gemini	Qwen3-Max	Qwen3-30B	Avg
Direct	60.3	55.5	49.8	24.9	47.6
Best-single	62.4	55.5	50.7	32.8	50.3
Embedding Router	64.2	61.0	54.6	32.8	53.1
Self-route	67.1	56.8	42.4	27.5	48.4
Oracle	72.9	73.4	72.5	56.8	68.9

Figure 3: The embedding router consistently outperforms Direct and Best-single across all models.

GPT-5 Success Rate Heatmap

Figure 4: Success rates across all paradigms and datasets. No single row dominates all columns.

Benchmarks

The evaluation data is not randomly thrown together — it follows a deliberate protocol:

HLE & GAIA — adopt MiroThinker's standardized text-only splits to match their evaluation setting
AIME, SEAL, τ-bench — used as-is from the original benchmark releases (full sets, no subsampling)
HumanEval, MATH500, HotpotQA, NQ, MMLU — sampled 100 examples with seed=42 for reproducibility, due to token-cost constraints across 4 models × 6 paradigms × ~761 tasks = ~18k runs

Dataset	Domain	# Used	Protocol
HumanEval	Code generation	100	Sampled (seed=42)
MATH500	Math reasoning	100	Sampled (seed=42)
AIME	Competition math	60	Full (2024 + 2025)
HotpotQA	Multi-hop QA	100	Sampled (seed=42)
Natural Questions	Factoid QA	100	Sampled (seed=42)
MMLU	Multitask knowledge	100	Sampled (seed=42)
HLE	Hard language exam	500	MiroThinker standardized
GAIA	Agent QA	50	MiroThinker standardized (text-only val)
τ-bench	Tool planning	51	Full
SEAL	Safety evaluation	—	Full

Data files are included in data/ and also released on Hugging Face: 🤗 henggg/paradigm-bench. See data/README.md for full details on sampling, sources, and task-type coverage.

Citation

@misc{zhou2026selectthensolveparadigmroutinginferencetime,
      title={Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents},
      author={Heng Zhou and Zelin Tan and Zhemeng Zhang and Yutao Fan and Yibing Lin and Li Kang and Xiufeng Song and Rui Li and Songtao Huang and Ao Yu and Yuchen Fan and Yanxu Chen and Kaixin Xu and Xiaohong Liu and Yiran Qin and Philip Torr and Chen Zhang and Zhenfei Yin},
      year={2026},
      eprint={2604.06753},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.06753},
}

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
data		data
docs		docs
results/aggregated		results/aggregated
scripts		scripts
src		src
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Select-then-Solve (STS): Paradigm Routing as Inference-Time Optimization for LLM Agents

Overview

Project Structure

Quick Start

1. Install dependencies

2. Configure API keys

3. Run experiments

4. Aggregate results

5. Train and evaluate router

Paradigms

Results

No Single Paradigm Wins

Router Comparison

GPT-5 Success Rate Heatmap

Benchmarks

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Select-then-Solve (STS): Paradigm Routing as Inference-Time Optimization for LLM Agents

Overview

Project Structure

Quick Start

1. Install dependencies

2. Configure API keys

3. Run experiments

4. Aggregate results

5. Train and evaluate router

Paradigms

Results

No Single Paradigm Wins

Router Comparison

GPT-5 Success Rate Heatmap

Benchmarks

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages