Official code for the paper "Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents".
We compare six inference-time reasoning paradigms (Direct, CoT, ReAct, Plan-Execute, Reflection, ReCode) across four frontier LLMs and ten benchmarks (~18k runs), and propose a lightweight embedding-based router that selects the best paradigm per task.
Figure 1: The Select-then-Solve pipeline. A lightweight router selects the best reasoning paradigm before the LLM answers.
Key findings:
- Reasoning structure is sharply task-dependent: ReAct improves +44pp on GAIA, CoT degrades -15pp on HumanEval
- No single paradigm dominates; oracle per-task selection beats the best fixed paradigm by 17.1pp
- Our embedding-based router improves average accuracy from 47.6% to 53.1%, recovering up to 37% of the oracle gap
- Zero-shot self-routing only works for the strongest model (GPT-5: 67.1%), weaker models fail
STS/
├── src/
│ ├── agent/ # BaseAgent + 6 strategy implementations
│ │ └── strategies/ # direct.py, cot.py, react.py, plan_execute.py, reflection.py, recode.py
│ ├── eval/ # Evaluation: metrics, code_eval, datasets
│ ├── predictor/ # Router: features, models, labeler, task_loader
│ └── tools/ # Web search + code execution tools
├── scripts/
│ ├── run_experiment.py # Main experiment runner
│ ├── analyze.py # Result aggregation
│ ├── train_predictor.py # Train handcrafted-feature router
│ ├── train_router_v2.py # Train embedding-based router + self-routing
│ └── evaluate_predictor.py # Evaluate all routing methods
├── configs/
│ ├── paper_colm.yaml # Config used for paper experiments
│ └── default_template.yaml # Template config (add your API keys)
├── results/
│ └── aggregated/ # Pre-computed summary and detailed CSVs
└── data/ # Benchmark data (from MiroThinker)
pip install -r requirements.txtcp configs/default_template.yaml configs/default.yaml
# Edit configs/default.yaml and add your API keys# Run a single model + paradigm + dataset
PYTHONPATH=. python scripts/run_experiment.py --model gpt-5 --paradigm react --dataset gaia
# Run all experiments from paper config
PYTHONPATH=. python scripts/run_experiment.py --config configs/paper_colm.yaml --parallel --workers 10PYTHONPATH=. python scripts/analyze.py --results-dir results# Train handcrafted-feature predictors
PYTHONPATH=. python scripts/train_predictor.py --config configs/paper_colm.yaml
# Train embedding-based router + run self-routing
PYTHONPATH=. python scripts/train_router_v2.py
# Evaluate all methods
PYTHONPATH=. python scripts/evaluate_predictor.py --config configs/paper_colm.yaml| Paradigm | Control | Tools | Description |
|---|---|---|---|
| Direct | None | No | Free-form answer, no scaffold |
| CoT | Instructed | No | Step-by-step reasoning |
| ReAct | Orchestrated | Yes | Thought-action loop with tools |
| Plan-Execute | Orchestrated | Yes | Plan first, then execute |
| Reflection | Orchestrated | Yes | Answer, critique, revise |
| ReCode | Substrate | Yes | Recursive code generation |
Figure 2: Direct vs. best paradigm vs. oracle on each dataset (GPT-5). The best paradigm differs for every task type.
| Method | GPT-5 | Gemini | Qwen3-Max | Qwen3-30B | Avg |
|---|---|---|---|---|---|
| Direct | 60.3 | 55.5 | 49.8 | 24.9 | 47.6 |
| Best-single | 62.4 | 55.5 | 50.7 | 32.8 | 50.3 |
| Embedding Router | 64.2 | 61.0 | 54.6 | 32.8 | 53.1 |
| Self-route | 67.1 | 56.8 | 42.4 | 27.5 | 48.4 |
| Oracle | 72.9 | 73.4 | 72.5 | 56.8 | 68.9 |
Figure 3: The embedding router consistently outperforms Direct and Best-single across all models.
Figure 4: Success rates across all paradigms and datasets. No single row dominates all columns.
The evaluation data is not randomly thrown together — it follows a deliberate protocol:
- HLE & GAIA — adopt MiroThinker's standardized text-only splits to match their evaluation setting
- AIME, SEAL, τ-bench — used as-is from the original benchmark releases (full sets, no subsampling)
- HumanEval, MATH500, HotpotQA, NQ, MMLU — sampled 100 examples with
seed=42for reproducibility, due to token-cost constraints across 4 models × 6 paradigms × ~761 tasks = ~18k runs
| Dataset | Domain | # Used | Protocol |
|---|---|---|---|
| HumanEval | Code generation | 100 | Sampled (seed=42) |
| MATH500 | Math reasoning | 100 | Sampled (seed=42) |
| AIME | Competition math | 60 | Full (2024 + 2025) |
| HotpotQA | Multi-hop QA | 100 | Sampled (seed=42) |
| Natural Questions | Factoid QA | 100 | Sampled (seed=42) |
| MMLU | Multitask knowledge | 100 | Sampled (seed=42) |
| HLE | Hard language exam | 500 | MiroThinker standardized |
| GAIA | Agent QA | 50 | MiroThinker standardized (text-only val) |
| τ-bench | Tool planning | 51 | Full |
| SEAL | Safety evaluation | — | Full |
Data files are included in data/ and also released on Hugging Face: 🤗 henggg/paradigm-bench. See data/README.md for full details on sampling, sources, and task-type coverage.
@misc{zhou2026selectthensolveparadigmroutinginferencetime,
title={Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents},
author={Heng Zhou and Zelin Tan and Zhemeng Zhang and Yutao Fan and Yibing Lin and Li Kang and Xiufeng Song and Rui Li and Songtao Huang and Ao Yu and Yuchen Fan and Yanxu Chen and Kaixin Xu and Xiaohong Liu and Yiran Qin and Philip Torr and Chen Zhang and Zhenfei Yin},
year={2026},
eprint={2604.06753},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.06753},
}MIT License