MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization

Yichen Han^1,, Yuhang Han^2,, Bojun Liu³, Zhengpeng Zhou⁴, Guanyu Liu⁵, Zeng Zhang¹, Yang Yang⁶, Wenli Wang⁶, Isaac N Shi⁶, Yunyan Zhang⁶, Lewei He^1✉, Tianyu Shi^7✉

¹South China Normal University ²Shanghai Jiao Tong University ³University of Sydney ⁴Shanghai Jiao Tong University ⁵University of Macau ⁶Silicon Sapiens LLC ⁷University of Toronto

* These authors contributed equally. ✉ Corresponding authors.

🚀 Abstract

Prompt design critically affects the performance of large language models (LLMs). Existing optimization methods often rely on single-agent heuristics, which lack diversity, collaboration, and robustness.
MAPGD introduces a multi-agent framework where each agent explores prompts from different perspectives, generates textual “gradients,” and collaboratively improves prompts via beam search, semantic fusion, and bandit-based selection.
This approach improves diversity, semantic directionality, and interpretability—offering a scalable and effective solution for real-world prompt engineering.

🧩 Core Features

Multi-Agent Exploration: Agents specialize in instruction clarity, example selection, output format, style, or mathematical reasoning.
Textual Gradients: Agents generate natural language pseudo-gradients akin to numerical gradients.
Gradient Coordination: HCGC ensures intra-cluster compactness and inter-cluster separation of gradients.
Adaptive Weighting: CAAW dynamically adjusts agent contributions based on historical performance.
Beam Search & Bandit Selection: Efficiently expand candidate prompts and select the best ones.

⚙️ System Workflow

Input: Initial prompt p0, datasets D_train / D_dev

Iterative Optimization:
  1. Agents generate specialized textual gradients
  2. HCGC clusters and fuses gradients
  3. Prompt expander generates candidate prompts (beam + paraphrasing)
  4. CAAW bandit-based selection chooses top candidates
  5. Agents synchronize with best candidate

Output: Optimized prompt

🚀 How to Start

1. Clone the repository

git clone https://github.com/kawhiiiileo/MAPGD.git
cd MAPGD

2. Run experiments with a specific task

To choose different task sets, use the --task argument. For example:

python experiment_baseline.py --task echo
python experiment_baseline.py --task aqua

Detailed settings can be customized in the configuration files:

Config files are located in the root folder for each task, e.g., echo_config.py, aqua_config.py.

You can customize hyperparameters such as beam_size, max_iterations, or CAAW lambda directly in the config.

📊 Dataset

1. ECHO Dataset

Task	Dataset Description	Reference
ECHO	English online hate speech detection dataset, containing 997 annotated online comments.	Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, and Grigorios Tsoumakas. Ethos: An online hate speech detection dataset. arXiv preprint arXiv:2006.08328, 2020.
AQUA	Algebraic word problems for program induction and step-by-step reasoning.	Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
GSM8k	Grade-school math problems requiring multi-step reasoning, widely used benchmark.	Cobbe, Karl et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
SVAMP	Simple arithmetic word problems with linguistic variations, testing robustness to paraphrasing.	Patel, Ananya et al. SVAMP: A benchmark for arithmetic word problem solving. arXiv preprint arXiv:2011.06770, 2020.
LIAR	Short statements labeled with ground-truth veracity, used for fake news detection.	William Y. Wang. “Liar, Liar Pants on Fire”: A new benchmark dataset for fake news detection. ACL, 2017.
Jailbreak	Multilingual prompts targeting jailbreak detection for LLMs, containing 1,306 examples.	Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. “Do Anything Now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. Proc. of ACM CCS, 2024, 1671–1685.
Ethos	English hate speech detection dataset, used for benchmarking multi-agent prompt optimization.	Vidgen, Bertie et al. Learning to detect harmful online content. arXiv preprint arXiv:2004.08617, 2020.
Sarcasm	Arabic sarcasm detection dataset with 10,000 online comments labeled for presence/absence of sarcasm.	Ibrahim Abu Farha and Walid Magdy. From Arabic sentiment analysis to sarcasm detection: The ArSarcasm dataset. The 4th Workshop on Open-Source Arabic Corpora and Processing Tools, ELRA, 2020, 32–39.

🔧 Notes

MAPGD supports both text classification and mathematical reasoning tasks.

Multi-agent collaboration is enabled by default, and HCGC+CAAW fusion ensures semantic consistency and adaptive weighting.

The framework allows easy integration of new agents or tasks by extending SpecializedPromptAgent and updating TASK_AGENT_MAPPING.

📖 Citation

If you find our work useful, please consider citing:

@misc{han2025mapgdmultiagentpromptgradient,
      title={MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization}, 
      author={Yichen Han and Yuhang Han and Bojun Liu and Zhengpeng Zhou and Guanyu Liu and Zeng Zhang and Yang Yang and Wenli Wang and Isaac N Shi and Yunyan Zhang and Lewei He and Tianyu Shi},
      year={2025},
      eprint={2509.11361},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.11361}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
README.md		README.md
WechatIMG378.jpg		WechatIMG378.jpg
aqua_config.py		aqua_config.py
core.py		core.py
ethos_config.py		ethos_config.py
experiment_baseline.py		experiment_baseline.py
hcgc_caaw.py		hcgc_caaw.py
init_prompt.py		init_prompt.py
liar_config.py		liar_config.py
llm.py		llm.py
mapgd_predictors.py		mapgd_predictors.py
mapgd_tasks.py		mapgd_tasks.py
utils.py		utils.py
workflow.png		workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization

Yichen Han^1,, Yuhang Han^2,, Bojun Liu³, Zhengpeng Zhou⁴, Guanyu Liu⁵, Zeng Zhang¹, Yang Yang⁶, Wenli Wang⁶, Isaac N Shi⁶, Yunyan Zhang⁶, Lewei He^1✉, Tianyu Shi^7✉

🚀 Abstract

🧩 Core Features

⚙️ System Workflow

🚀 How to Start

1. Clone the repository

2. Run experiments with a specific task

📊 Dataset

1. ECHO Dataset

🔧 Notes

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization

Yichen Han1,*, Yuhang Han2,*, Bojun Liu3, Zhengpeng Zhou4, Guanyu Liu5, Zeng Zhang1, Yang Yang6, Wenli Wang6, Isaac N Shi6, Yunyan Zhang6, Lewei He1✉, Tianyu Shi7✉

🚀 Abstract

🧩 Core Features

⚙️ System Workflow

🚀 How to Start

1. Clone the repository

2. Run experiments with a specific task

📊 Dataset

1. ECHO Dataset

🔧 Notes

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Yichen Han^1,, Yuhang Han^2,, Bojun Liu³, Zhengpeng Zhou⁴, Guanyu Liu⁵, Zeng Zhang¹, Yang Yang⁶, Wenli Wang⁶, Isaac N Shi⁶, Yunyan Zhang⁶, Lewei He^1✉, Tianyu Shi^7✉

Packages