Yichen Han1,*, Yuhang Han2,*, Bojun Liu3, Zhengpeng Zhou4, Guanyu Liu5, Zeng Zhang1, Yang Yang6, Wenli Wang6, Isaac N Shi6, Yunyan Zhang6, Lewei He1✉, Tianyu Shi7✉
1South China Normal University 2Shanghai Jiao Tong University 3University of Sydney 4Shanghai Jiao Tong University 5University of Macau 6Silicon Sapiens LLC 7University of Toronto
* These authors contributed equally. ✉ Corresponding authors.
Prompt design critically affects the performance of large language models (LLMs). Existing optimization methods often rely on single-agent heuristics, which lack diversity, collaboration, and robustness.
MAPGD introduces a multi-agent framework where each agent explores prompts from different perspectives, generates textual “gradients,” and collaboratively improves prompts via beam search, semantic fusion, and bandit-based selection.
This approach improves diversity, semantic directionality, and interpretability—offering a scalable and effective solution for real-world prompt engineering.
- Multi-Agent Exploration: Agents specialize in instruction clarity, example selection, output format, style, or mathematical reasoning.
- Textual Gradients: Agents generate natural language pseudo-gradients akin to numerical gradients.
- Gradient Coordination: HCGC ensures intra-cluster compactness and inter-cluster separation of gradients.
- Adaptive Weighting: CAAW dynamically adjusts agent contributions based on historical performance.
- Beam Search & Bandit Selection: Efficiently expand candidate prompts and select the best ones.
Input: Initial prompt p0, datasets D_train / D_dev
Iterative Optimization:
1. Agents generate specialized textual gradients
2. HCGC clusters and fuses gradients
3. Prompt expander generates candidate prompts (beam + paraphrasing)
4. CAAW bandit-based selection chooses top candidates
5. Agents synchronize with best candidate
Output: Optimized prompt
git clone https://github.com/kawhiiiileo/MAPGD.git
cd MAPGDTo choose different task sets, use the --task argument. For example:
python experiment_baseline.py --task echo
python experiment_baseline.py --task aqua
Detailed settings can be customized in the configuration files:
Config files are located in the root folder for each task, e.g., echo_config.py, aqua_config.py.
You can customize hyperparameters such as beam_size, max_iterations, or CAAW lambda directly in the config.
| Task | Dataset Description | Reference |
|---|---|---|
| ECHO | English online hate speech detection dataset, containing 997 annotated online comments. | Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, and Grigorios Tsoumakas. Ethos: An online hate speech detection dataset. arXiv preprint arXiv:2006.08328, 2020. |
| AQUA | Algebraic word problems for program induction and step-by-step reasoning. | Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017. |
| GSM8k | Grade-school math problems requiring multi-step reasoning, widely used benchmark. | Cobbe, Karl et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. |
| SVAMP | Simple arithmetic word problems with linguistic variations, testing robustness to paraphrasing. | Patel, Ananya et al. SVAMP: A benchmark for arithmetic word problem solving. arXiv preprint arXiv:2011.06770, 2020. |
| LIAR | Short statements labeled with ground-truth veracity, used for fake news detection. | William Y. Wang. “Liar, Liar Pants on Fire”: A new benchmark dataset for fake news detection. ACL, 2017. |
| Jailbreak | Multilingual prompts targeting jailbreak detection for LLMs, containing 1,306 examples. | Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. “Do Anything Now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. Proc. of ACM CCS, 2024, 1671–1685. |
| Ethos | English hate speech detection dataset, used for benchmarking multi-agent prompt optimization. | Vidgen, Bertie et al. Learning to detect harmful online content. arXiv preprint arXiv:2004.08617, 2020. |
| Sarcasm | Arabic sarcasm detection dataset with 10,000 online comments labeled for presence/absence of sarcasm. | Ibrahim Abu Farha and Walid Magdy. From Arabic sentiment analysis to sarcasm detection: The ArSarcasm dataset. The 4th Workshop on Open-Source Arabic Corpora and Processing Tools, ELRA, 2020, 32–39. |
MAPGD supports both text classification and mathematical reasoning tasks.
Multi-agent collaboration is enabled by default, and HCGC+CAAW fusion ensures semantic consistency and adaptive weighting.
The framework allows easy integration of new agents or tasks by extending SpecializedPromptAgent and updating TASK_AGENT_MAPPING.
If you find our work useful, please consider citing:
@misc{han2025mapgdmultiagentpromptgradient,
title={MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization},
author={Yichen Han and Yuhang Han and Bojun Liu and Zhengpeng Zhou and Guanyu Liu and Zeng Zhang and Yang Yang and Wenli Wang and Isaac N Shi and Yunyan Zhang and Lewei He and Tianyu Shi},
year={2025},
eprint={2509.11361},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2509.11361},
}