This repository contains the implementation and evaluation code for our paper "Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness", which investigates the MAD framework compared to test-time scaling in single-agent settings across the following two domains:
- Mathematical Reasoning
- Safety
We recommend using conda for dependency management:
conda create -n mad python=3.11 -y
conda activate mad
pip install -r requirements.txt
⚠️ If you want to use OpenAI's API for evaluation, you must downgrade the OpenAI Python package:
pip install openai==0.28.0To run your own API server for model inference:
bash open_server.shRefer to open_server.sh for details on model loading and configuration.
The codebase includes inference scripts for two domains:
- Single-agent inference:
bash scripts/generation/run_math_single.sh- Multi-agent debate:
bash scripts/generation/run_math_multi.shKey parameters:
- Dataset selection (
gsm8k,math500,aime2024,aime2025) - Number of agents and voting mechanism
- Single-agent self-refinement:
bash scripts/generation/run_safety_single.sh- Multi-agent debate:
bash scripts/generation/run_safety_multi.shConfigurable options:
- Dataset selection (
safetybench,advbench) - Scaling method (
self-refinement,best-of-N) - Persona and judge model settings
Note that the current version of safety reasoning does not support direct use with vLLM, so you need to use open_server.sh.
Single-agent evaluation:
bash scripts/eval/eval_math_self.sh DATASET_NAMEMulti-agent evaluation:
bash scripts/eval/eval_math_multi.sh DATASET_NAMERun all evaluations across all datasets:
bash scripts/eval/evaluate_all.shbash scripts/eval/eval_safety_api.sh # API-based evaluation