Skip to content

euiin/MAD_as_TTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Revisiting Multi-Agent Debate as Test-Time Scaling

This repository contains the implementation and evaluation code for our paper "Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness", which investigates the MAD framework compared to test-time scaling in single-agent settings across the following two domains:

  • Mathematical Reasoning
  • Safety

Environment Setup

We recommend using conda for dependency management:

conda create -n mad python=3.11 -y
conda activate mad
pip install -r requirements.txt

⚠️ If you want to use OpenAI's API for evaluation, you must downgrade the OpenAI Python package:

pip install openai==0.28.0

Model Server Setup (Optional)

To run your own API server for model inference:

bash open_server.sh

Refer to open_server.sh for details on model loading and configuration.

Inference Scripts

The codebase includes inference scripts for two domains:

Mathematical Reasoning

  • Single-agent inference:
bash scripts/generation/run_math_single.sh
  • Multi-agent debate:
bash scripts/generation/run_math_multi.sh

Key parameters:

  • Dataset selection (gsm8k, math500, aime2024, aime2025)
  • Number of agents and voting mechanism

Safety Reasoning

  • Single-agent self-refinement:
bash scripts/generation/run_safety_single.sh
  • Multi-agent debate:
bash scripts/generation/run_safety_multi.sh

Configurable options:

  • Dataset selection (safetybench, advbench)
  • Scaling method (self-refinement, best-of-N)
  • Persona and judge model settings

Note that the current version of safety reasoning does not support direct use with vLLM, so you need to use open_server.sh.

Evaluation Scripts

Mathematical Reasoning Evaluation

Single-agent evaluation:

bash scripts/eval/eval_math_self.sh DATASET_NAME

Multi-agent evaluation:

bash scripts/eval/eval_math_multi.sh DATASET_NAME

Run all evaluations across all datasets:

bash scripts/eval/evaluate_all.sh

Safety Evaluation

bash scripts/eval/eval_safety_api.sh  # API-based evaluation

About

Our paper seeks a systematic understanding of MAD's effectiveness compared to self-agent methods, by conceptualizing MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors