CMAT
CMAT
Abstract
arXiv:2404.01663v2 [cs.CL] 4 Apr 2024
in complex systems (Lu and Zhang, 2020). compassing 1.8 billion parameters, can achieve per-
To enhance the adaptability and collaborative ca- formance levels akin to those of models with greater
pabilities of LLMs in multi-agent systems, we’ve parameter counts, like 6 billion parameters, when
implemented memory modes, including long-term supported by high-quality datasets (Stiennon et al.,
support and short-term memory with environmental 2020). This demonstrates that excellent data quality
feedback (Liang et al., 2016). This allows LLMs to and appropriate tuning strategies play a decisive role
better interact, learn, and adapt in dynamic environ- in the performance of LLMs. Therefore, investing
ments, leveraging past experiences and responding efforts in improving data quality and choosing the
to changes swiftly. right tuning methods is essential for achieving op-
timal performance of LLMs in various application
2.2 The tuning method for LLMs scenarios (Howard and Ruder, 2018). Through our
work combining supervised fine-tuning with rein-
The main tuning methods include supervised fine- forcement learning, we’ve notably advanced LLM
tuning and reinforcement learning (Ouyang et al., performance across a spectrum of tasks, showcas-
2022). Supervised fine-tuning enhances performance ing significant improvements in task-specific bench-
by training models on specific task datasets, and is marks (Ouyang et al., 2022).
especially suitable for tasks such as natural language
understanding (NLU) (Howard and Ruder, 2018). 3 Methodology
On the other hand, reinforcement learning, guided by
reward mechanisms, is suitable for handling complex Our work focuses on the design and implementation
and variable tasks (Mnih et al., 2015). The effective of a multi-agent LLM tuning framework, enhanc-
combination of these two methods can significantly ing decision-making quality, controllability, and ef-
improve the performance of LLMs in various tasks. ficiency in complex systems through collaborative
Notably, LLMs of reduced scale, such as those en- communication and task completion among differ-
Figure 3: Comparative study of Llama-2-7b and TinyAgent-7b in DataBase cases. (1) In DataBase tasks with a
reflection mechanism, Llama-2-7b still made errors after reflection, while TinyAgent-7b adjusted its operations after
reflecting on its first failed attempt. (2) Without a reflection mechanism, TinyAgent-7b repeated the same operation and
ultimately failed to complete the task.
ent agent roles (i.e., User, Assistant, Checker) (Lu 2019). In traditional reinforcement learning (RL)
et al., 2022). It emphasizes the use of supervised fine- frameworks, the Actor decides on actions based on
tuning techniques like LoRA (Hu et al., 2021) and a specific policy and receives information about the
PTuning (Lester et al., 2021) based on pre-trained current state of the environment. In the CMAT frame-
datasets such as AgentBench, and further tuning work, the Actor model utilizes approaches like Chain
through environment interaction and memory up- of Thought and ReAct, enhancing the traditional text
dates, akin to the process of Reinforcement Learning generation models to delve into different facets of
from Human Feedback (RLHF) (Vázquez-Canteli behavior generation.
and Nagy, 2019). In the CMAT framework as shown in the algo-
rithm 1, the "Checker" plays a vital role, specifically
3.1 Actor-Evaluator Dynamics in the CMAT tasked with verifying the correctness of the outputs
Framework produced by the Actor. It conducts a thorough evalua-
In multi-agent systems, the Actor plays a crucial tion process to ensure that the generated outputs meet
role in behavior generation through LLMs. The Ac- the standards and requirements of the specific task.
tor responds to environmental state observations to This role, by guaranteeing the accuracy and quality
generate text and actions, while also using reflec- of the outputs, plays a critical role in overseeing and
tion to assess and adjust its behavior (Wang and Shi, ensuring the overall performance quality of the sys-
Algorithm 1 CMAT Framework ent context, especially when managing the intri-
1: Initialize: LLMs, User, Assistant, Checker, task cate interactions and decisions in multi-agent en-
set T , policy π0 , memory mem, Actor θactor , vironments. Long-term memory enhances LLMs’
Critic θcritic . decision-making by allowing them to draw on past
2: for B ∈ T do experiences when facing new challenges, thereby im-
3: Execute Task: Assign Assistant and proving both their adaptability and response speed.
Checker. For example, by recalling relevant past interactions,
4: while not complete(B) do LLMs can develop more precise strategies for new
5: a ← Action from Assistant via LLMs; tasks (Penta, 2020).
Execute a for (s′ , r).
Short-Term Memory and Environmental
6: if Checker verifies a then
Feedback
7: Update s′ , mem+ = (s′ , r).
8: else Short-term memory, in contrast, focuses on the
9: Adjust π, LLMs; retry B. immediate context and the rapid processing of
10: end if new information (Pae et al., 2012; Liu and Guo,
11: end while 2019). This is particularly vital in dynamic environ-
12: Policy Update: Use Actor-Critic Method to ments where conditions and requirements can shift
refine π with feedback. quickly (Martin, 1993). Short-term memory enables
13: Calculate TD Error δt = r + γV (st+1 ) − LLMs to adapt to these changes effectively, ensur-
V (st ). ing timely and relevant responses to new challenges
14: Update Actor θactor using gradient ascent: or tasks. Environmental communication feedback
θactor ← θactor + α∇θactor log π(a|s)δt . mechanisms complement these memory modes by
15: Update Critic θcritic using TD Error: θcritic ← providing real-time inputs and interactions within the
θcritic + βδt ∇θcritic V (s). multi-agent framework (Yogatama et al., 2021). This
16: Update Checker Strategy: Revise the strat- continuous flow of information helps LLMs to adjust
egy of Checker to align with the updated policy their strategies based on the latest data, facilitating a
π and the latest environmental states and predic- more fluid and responsive adaptation process. The
tions. feedback loop is essential for refining the actions and
17: end for decisions of LLMs, allowing for incremental learn-
18: Complete: Verify all T ; ing and optimization based on immediate outcomes
19: Output: Final states, evaluations. and environmental cues (Davelaar et al., 2005).
(s′ , r) in memory mem; otherwise, it adjusts strategy and incorporates both short-term and long-term mem-
π and retries. The entire process not only involves ory updates for efficient decision-making. Our ap-
confirmation of correct actions but also includes up- proach, enhanced with self-reflection and experience
dating the strategy π based on feedback from the replay, offers deeper understanding and better re-
checker, creating a continuous cycle of learning and sponse to complex scenarios, leading to more precise
adjustment. Each iteration aims to optimize the assis- and rational decisions in changing conditions.
tant’s decision-making strategy through practice and
reflection, enhancing the efficiency and accuracy of 4 Experiments
task execution. Once all tasks are verified, it outputs
the final states and evaluation results of all tasks (Sil- Our evaluation framework rigorously tests intel-
ver et al., 2017).Within this cycle, there is also a ligent agents in six key domains to ensure their
critical step: updating the checker’s strategy. This readiness for diverse real-world challenges (Ross
means that after each update of the strategy π, not et al., 2023). These areas include seamless LLM
only does the assistant’s strategy need to be adjusted integration into OS with an emphasis on security
based on feedback, but the checker’s verification and user interaction; proficiency in real DB oper-
strategy also needs to be updated synchronously to ations using SQL (Halevy et al., 2004); task exe-
ensure its decision-making logic is consistent with cution on the simulated e-commerce platform Web-
the latest environmental state and strategy. Such up- Shop(WS) (Yao et al., 2022); constructing and using
dates ensure the checker can accurately assess the KGs for enhanced semantic understanding; employ-
assistant’s actions, further enhancing the system’s ing the M2W dataset for complex web tasks, marking
overall performance and reliability. Through this the first dataset for developing general web agents
approach, the CMAT framework achieves a deeper following language instructions; and applying ab-
level of collaboration and self-optimization between stract reasoning and visual tasks in the text-based
computation and cognition, enabling AI systems to ALFWorld(ALF) (Shridhar et al., 2021). For more
face new challenges and environments with higher implementation and evaluation details, see Appen-
efficiency and adaptability. dices A and B.
to the testing objectives (Sachdeva and McAuley, models. This performance was not only significant
2023). Faced with issues of data imbalance and in- in code correction tasks but also prominent in other
sufficient samples, we utilized data augmentation checking tasks such as OS configuration, DB query
and knowledge distillation techniques. Knowledge optimization, and WS management. These findings
distillation helped us to extract the most valuable and highlight that TinyAgent-1.8B not only possesses
representative information from the vast amount of efficient code analysis capabilities but is also widely
collected data, thus building an efficient and refined applicable to the inspection and optimization of other
testing dataset. This process significantly improved complex systems.
the quality and applicability of the dataset, providing
a solid foundation for evaluating the capabilities of 4.3 Baselines
model agents (Mishra and Marr, 2017). In the baseline section of our study, we’ve selected
Qwen-1.8B and CodeLlama-7B as pivotal bench-
4.2 Evaluating Code Correction marks to assess the TinyAgent series’ performance,
As shown in the Table 1, in this study, we con- excluding the CMAT framework’s influence.
ducted a comprehensive performance evaluation of
TinyAgent-1.8B and the CodeLlama series models 4.4 Results analysis
(CodeLlama7B and CodeLlama13B), aiming to ex- The results in Table 2 underscore the effectiveness
plore their multi-task checking capabilities, includ- of our fine-tuning methods, especially for the TinyA-
ing but not limited to code correction, OS configu- gent models. Tinyagent-1.8B demonstrates signifi-
ration, DB query optimization, and WS. The experi- cant performance in the KG task, on par with ad-
mental results showed that TinyAgent-1.8B demon- vanced models like GPT-3.5. Tinyagent-7B also
strated a significant advantage in cross-task perfor- showcases its strengths, notably in the DB task,
mance evaluation compared to the CodeLlama series where it surpasses its foundational model (Antonello
Table 3: Ablation study on the effect of agent and general instructions.
et al., 2020), CodeLlama-7B, and offers competi- lem analysis and reflection demonstrate its advanced
tive scores against GPT-4. These findings indicate problem-solving skills and deep understanding of
the TinyAgent models’ capacity to match or even issues.
surpass models with larger parameters in certain as-
pects. Moreover, the CMAT framework’s potential 4.6 Ablation Study
to enhance the capabilities of smaller-scale models The Table 3 presents an ablation study on the
is highlighted, allowing the TinyAgent models to TinyAgent-7B model, delineating the impact of
closely compete with the performance of advanced agent-specific and general instructions on task per-
models such as GPT-4. formance. The composite model, TinyAgent-7B,
As illustrated in Figure 1, Our comparative anal- demonstrates the highest efficacy, notably in WS and
ysis indicates that Tinyagent models, refined from DB tasks, which implies its adeptness in handling
Qwen-1.8B and CodeLlama-7B, exhibit superior per- complex e-commerce interactions and database man-
formance to their base models. The incorporation of agement. The agent-only variant exhibits a decline in
the CMAT framework further amplifies their func- performance, suggesting that while task-specific in-
tionality, equipping these small Models to match the structions are crucial, they are not wholly sufficient
capabilities of GPT-3.5. This performance boost is for the breadth of tasks such as KG. The general-
credited to CMAT’s optimization of model interac- only model’s performance is considerably reduced
tions and its strategic use of memory modes for spe- across all tasks, with a complete inability to perform
cific tasks, confirming its effectiveness in enhancing in KG and ALF, highlighting the indispensability of
the sophistication of fine-tuned models (Deshpande agent-specific instructions. This data underscores
et al., 2021). the necessity of integrating both agent-specific and
4.5 Error analysis general instructions to enhance the versatility and
effectiveness of AI models in diverse task domains.
In our testing framework’s error analysis, we ob-
served common challenges in DB tasks faced by 5 Conclusions
models, such as difficulties in understanding user
requests, executing actions, and pre-action problem The main findings of our work reveal that care-
analysis. Many models simply respond with "OK" to fully trained small-parameter models on excellent
specific instructions without performing actual SQL datasets can achieve performance comparable to that
operations, indicating a gap in transforming user re- of large-parameter models. With the application of
quests into database actions. Models often provide the CMAT framework, we further demonstrate the
superficial acknowledgments without delivering pre- significant potential for performance improvement
cise execution or in-depth problem analysis, failing in large-parameter models, highlighting the impor-
to meet user expectations. In contrast, the TinyAgent tance of model design and optimization strategies
series excels in understanding and converting user for parameter size. In our evaluation, although most
requests into actual SQL operations, effectively com- open-source LLMs performed poorly compared to
prehending and executing tasks. It provides clear API-provided models without optimization, some
responses and adheres to user-specified SQL for- models displayed similar capabilities to API models
mats, fulfilling user expectations comprehensively. after meticulous fine-tuning of the TinyAgent model.
Additionally, TinyAgent’s thorough pre-action prob- This finding emphasizes not only the importance of
parameter size in handling real-world environmental IEEE International Symposium on Software Reliability
interactions but also showcases the enormous po- Engineering Workshops (ISSREW), pages 114–117.
tential of even smaller models through the CMAT Chunjiang Che, Xiaoli Li, Chuan Chen, Xiaoyu He, and
framework and precise adjustment strategies. Zibin Zheng. 2021. A decentralized federated learn-
ing framework via committee mechanism with conver-
Limitations gence guarantee. IEEE Transactions on Parallel and
Distributed Systems, 33:4783–4800.
In this study, we demonstrated the potential for
performance improvement by applying the CMAT C. L. P. Chen and Zhulin Liu. 2018. Broad learning sys-
tem: An effective and efficient incremental learning
framework to TinyAgent series models and other system without the need for deep architecture. IEEE
large language models (LLMs). However, there are Transactions on Neural Networks and Learning Sys-
clear limitations to the research: First, although most tems, 29:10–24.
models showed improved performance, some mod- Filippos Christianos, Georgios Papoudakis, Matthieu Zim-
els saw limited improvement due to weaker base mer, Thomas Coste, Zhihao Wu, Jingxuan Chen, Khy-
agent capabilities, indicating that the effectiveness ati Khandelwal, James Doran, Xidong Feng, Jiacheng
of the CMAT framework might vary significantly Liu, et al. 2023. Pangu-agent: A fine-tunable gener-
alist agent with structured reasoning. arXiv preprint
between different models; second, the limitations of arXiv:2312.14878.
datasets and task types could affect the broad applica-
bility of the conclusions, while low-quality datasets E. Davelaar, Y. Goshen-Gottstein, Amir Ashkenazi,
could negatively impact model performance; lastly, H. Haarmann, and M. Usher. 2005. The demise of
short-term memory revisited: empirical and computa-
although evaluations based on AgentBench ensured tional investigations of recency effects. Psychological
fairness, they might not fully reflect the complexity review, 112 1:3–42.
of real-world scenarios, and due to computational
I de Zarzà, J de Curtò, Gemma Roig, Pietro Manzoni,
resource constraints, larger-scale models could not and Carlos T Calafate. 2023. Emergent cooperation
be tested. This underscores the importance of future and strategy adaptation in multi-agent systems: An
work to consider a wider range of models, datasets, extended coevolutionary theory with llms. Electronics,
and task types, especially the implementation of op- 12(12):2722.
timization strategies and framework applications in A. Deshpande, A. Achille, Avinash Ravichandran, Hao
resource-constrained situations. Li, L. Zancato, Charless C. Fowlkes, Rahul Bhotika,
Stefano Soatto, and P. Perona. 2021. A linearized
framework and a new benchmark for model selection
References for fine-tuning. ArXiv, abs/2102.00084.
Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Elhadji Amadou Oury Diallo, Ayumi Sugiyama, and
Large language models associate muslims with vio- T. Sugawara. 2020. Coordinated behavior of coop-
lence. Nature Machine Intelligence, 3(6):461–463. erative agents using deep reinforcement learning. Neu-
rocomputing, 396:230–240.
Richard J. Antonello, Javier Turek, and Alexander G.
Huth. 2020. Selecting informative contexts improves Mengnan Du, Fengxiang He, Na Zou, Dacheng Tao, and
language model fine-tuning. ArXiv, abs/2005.00175. Xia Hu. 2022. Shortcut learning of large language
models in natural language understanding: A survey.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, arXiv preprint arXiv:2208.11857.
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei
Huang, et al. 2023. Qwen technical report. arXiv Alexander Dunn, John Dagdelen, Nicholas Walker,
preprint arXiv:2309.16609. Sanghoon Lee, Andrew S Rosen, Gerbrand Ceder,
Kristin Persson, and Anubhav Jain. 2022. Structured
Yalong Bai, Kuiyuan Yang, Wei Yu, Chang Xu, Wei- information extraction from complex scientific text
Ying Ma, and T. Zhao. 2015. Automatic image dataset with fine-tuned large language models. arXiv preprint
construction from click-through logs using deep neural arXiv:2212.05238.
network. Proceedings of the 23rd ACM international
conference on Multimedia. N. Ferry, F. Chauvel, Hui Song, A. Rossini, Maksym
Lushpenko, and Arnor Solberg. 2018. Cloudmf:
Sushovan Bhadra. 2022. A stochastic petri net model of Model-driven management of multi-cloud applications.
continuous integration and continuous delivery. 2022 ACM Trans. Internet Techn., 18:16:1–16:24.
Eloy García, Yongcan Cao, and D. Casbeer. 2015. Peri- Xin Liang, G. Shen, and Shanshan Bu. 2016. Multia-
odic event-triggered synchronization of linear multi- gent systems in construction: A ten-year review. Jour-
agent systems with communication delays. IEEE nal of Computing in Civil Engineering, 30:04016016–
Transactions on Automatic Control, 62:366–371. 04016016.
Jianping Gou, B. Yu, S. Maybank, and D. Tao. 2020. Gang Liu and Jiabao Guo. 2019. Bidirectional lstm with
Knowledge distillation: A survey. International Jour- attention mechanism and convolutional layer for text
nal of Computer Vision, 129:1789 – 1819. classification. Neurocomputing, 337:325–338.
A. Halevy, Z. Ives, J. Madhavan, P. Mork, Dan Suciu, and Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu
I. Tatarinov. 2004. The piazza peer data management Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men,
system. IEEE Transactions on Knowledge and Data Kejuan Yang, et al. 2023. Agentbench: Evaluating
Engineering, 16:787–798. llms as agents. arXiv preprint arXiv:2308.03688.
José Hernández-Orallo, Marco Baroni, Jordi Bieger, Kai Lu and Chongyang Zhang. 2020. Blockchain-based
Nader Chmait, David L Dowe, Katja Hofmann, multiparty computation system. In 2020 IEEE 11th
Fernando Martínez-Plumed, Claes Strannegård, and International Conference on Software Engineering and
Kristinn R Thórisson. 2017. A new ai evaluation cos- Service Science (ICSESS), pages 28–31. IEEE.
mos: Ready to play the game? AI Magazine, 38(3):66–
69. Zehui Lu, Wanxin Jin, S. Mou, and B. Anderson. 2022.
Cooperative tuning of multi-agent optimal control sys-
Jeremy Howard and Sebastian Ruder. 2018. Universal lan- tems. 2022 IEEE 61st Conference on Decision and
guage model fine-tuning for text classification. pages Control (CDC), pages 571–576.
328–339.
R. Martin. 1993. Short-term memory and sentence pro-
cessing: Evidence from neuropsychology. Memory
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
and Cognition, 21:176–183.
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu
Chen. 2021. Lora: Low-rank adaptation of large lan- Asit K. Mishra and Debbie Marr. 2017. Apprentice: Us-
guage models. arXiv preprint arXiv:2106.09685. ing knowledge distillation techniques to improve low-
precision network accuracy. ArXiv, abs/1711.05852.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B
Brown, Benjamin Chess, Rewon Child, Scott Gray, Volodymyr Mnih, K. Kavukcuoglu, David Silver, An-
Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. drei A. Rusu, J. Veness, Marc G. Bellemare, A. Graves,
Scaling laws for neural language models. arXiv Martin A. Riedmiller, A. Fidjeland, Georg Ostrovski,
preprint arXiv:2001.08361. Stig Petersen, Charlie Beattie, Amir Sadik, Ioannis
Antonoglou, Helen King, D. Kumaran, Daan Wierstra,
Daiju Kato, Ayumu Shimizu, and Hiroshi Ishikawa. 2022. S. Legg, and D. Hassabis. 2015. Human-level control
Quality classification for testing work in devops. In through deep reinforcement learning. Nature, 518:529–
Proceedings of the 14th International Conference on 533.
Management of Digital EcoSystems, pages 156–162.
OpenAI. 2023. Gpt-4 technical report.
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
The power of scale for parameter-efficient prompt tun- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
ing. arXiv preprint arXiv:2104.08691. roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, J. Schul-
Mike Lewis, Denis Yarats, Yann N Dauphin, Devi Parikh, man, Jacob Hilton, Fraser Kelton, Luke E. Miller, Mad-
and Dhruv Batra. 2017. Deal or no deal? end-to- die Simens, Amanda Askell, P. Welinder, P. Christiano,
end learning for negotiation dialogues. arXiv preprint J. Leike, and Ryan J. Lowe. 2022. Training language
arXiv:1706.05125. models to follow instructions with human feedback.
ArXiv, abs/2203.02155.
Guanbin Li, Yukang Gan, Hejun Wu, Nong Xiao, and
Liang Lin. 2018. Cross-modal attentional context Hye K Pae et al. 2012. Linguistic relativity revisited: The
learning for rgb-d object detection. IEEE Transactions interaction between l1 and l2 in thinking, learning, and
on Image Processing, 28:1591–1601. production. Psychology, 3(01):49.
Guohao Li, Hasan Abed Al Kader Hammoud, Hani M. Di Penta. 2020. Understanding and improving contin-
Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. uous integration and delivery practice using data from
Camel: Communicative agents for" mind" exploration the wild. Proceedings of the 13th Innovations in Soft-
of large scale language model society. arXiv preprint ware Engineering Conference on Formerly known as
arXiv:2303.17760. India Software Engineering Conference.
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Shengyi Huang, Leandro von Werra, Clémentine
Know what you don’t know: Unanswerable questions Fourrier, Nathan Habib, et al. 2023. Zephyr: Di-
for squad. arXiv preprint arXiv:1806.03822. rect distillation of lm alignment. arXiv preprint
arXiv:2310.16944.
Steven I. Ross, Fernando Martinez, Stephanie Houde,
Michael J. Muller, and Justin D. Weisz. 2023. The pro- José R Vázquez-Canteli and Zoltán Nagy. 2019. Rein-
grammer’s assistant: Conversational interaction with a forcement learning for demand response: A review of
large language model for software development. ArXiv, algorithms and modeling techniques. Applied energy,
abs/2302.07080. 235:1072–1089.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Guihong Wang and Jinglun Shi. 2019. Actor-critic for
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu multi-agent system with variable quantity of agents. In
Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: IoT as a Service: 4th EAI International Conference,
Open foundation models for code. arXiv preprint IoTaaS 2018, Xi’an, China, November 17–18, 2018,
arXiv:2308.12950. Proceedings 4, pages 48–56. Springer.
Noveen Sachdeva and Julian McAuley. 2023. Data distil- Jun Wang, Yong-Hong Sun, Z. Fan, and Yan Liu. 2005. A
lation: A survey. ArXiv, abs/2301.04272. collaborative e-learning system based on multi-agent.
pages 455–463.
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté,
Yonatan Bisk, Adam Trischler, and Matthew Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang,
Hausknecht. 2021. ALFWorld: Aligning Text and Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang,
Embodied Environments for Interactive Learning. In Dong Yan, et al. 2023. Baichuan 2: Open large-scale
Proceedings of the International Conference on Learn- language models. arXiv preprint arXiv:2309.10305.
ing Representations (ICLR).
Shunyu Yao, Howard Chen, John Yang, and Karthik
David Silver, Julian Schrittwieser, Karen Simonyan, Ioan-
Narasimhan. 2022. Webshop: Towards scalable real-
nis Antonoglou, Aja Huang, Arthur Guez, Thomas
world web interaction with grounded language agents.
Hubert, Lucas Baker, Matthew Lai, Adrian Bolton,
et al. 2017. Mastering the game of go without human Dani Yogatama, Cyprien de Masson d’Autume, and Ling-
knowledge. nature, 550(7676):354–359. peng Kong. 2021. Adaptive semiparametric language
models. Transactions of the Association for Computa-
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M.
tional Linguistics, 9:362–373.
Ziegler, Ryan J. Lowe, Chelsea Voss, Alec Radford,
Dario Amodei, and Paul Christiano. 2020. Learn-
Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao
ing to summarize from human feedback. ArXiv,
Liu, Yuxiao Dong, and Jie Tang. 2023. Agenttuning:
abs/2009.01325.
Enabling generalized agent abilities for llms. arXiv
Yashar Talebirad and Amirhossein Nadiri. 2023. Multi- preprint arXiv:2310.12823.
agent collaboration: Harnessing the power of intelli-
gent llm agents. arXiv preprint arXiv:2306.03314. Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei
Lu. 2024. Tinyllama: An open-source small language
V. Talwar, Qinyi Wu, C. Pu, W. Yan, G. Jung, and D. Milo- model. arXiv preprint arXiv:2401.02385.
jicic. 2005. Comparison of approaches to service de-
ployment. 25th IEEE International Conference on Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Distributed Computing Systems (ICDCS’05), pages Artetxe, Moya Chen, Shuohui Chen, Christopher De-
543–552. wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022.
Opt: Open pre-trained transformer language models.
Meiling Tao, Xuechen Liang, Tianyu Shi, Lei Yu, and Yit- arXiv preprint arXiv:2205.01068.
ing Xie. 2023. Rolecraft-glm: Advancing personalized
role-playing in large language models. arXiv preprint Wentao Zhang, Yang Liu, Jianquan Lu, and Jinde Cao.
arXiv:2401.09432. 2017. A novel consensus algorithm for second-order
multi-agent systems without velocity measurements.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- International Journal of Robust and Nonlinear Control,
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- 27:2510 – 2528.
lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos-
ale, et al. 2023. Llama 2: Open foundation and fine- Kai Zhao, Yongduan Song, CL Philip Chen, and Long
tuned chat models. arXiv preprint arXiv:2307.09288. Chen. 2021. Adaptive asymptotic tracking with global
performance for nonlinear systems with unknown con-
Lewis Tunstall, Edward Beeching, Nathan Lambert, trol directions. IEEE Transactions on Automatic Con-
Nazneen Rajani, Kashif Rasul, Younes Belkada, trol, 67(3):1566–1573.
A Implementation Settings LLMs in automating database access through
T5QL, a new SQL generation method. Ad-
In this paper, we describe an experiment conducted ditionally, utilizing fine-tuned LLMs (such as
using the Low-Rank Adaptation (LoRA) fine-tuning GPT-3.5) to extract and link complex scientific
method to enhance the performance of various mod- information from scientific texts has demon-
els (Wang et al., 2005). The accuracy of the LoRA strated the capacity of LLMs to obtain struc-
method is of paramount importance in dealing with tured knowledge from unstructured text and
personalized and emotionally rich content. It en- subsequently construct large databases (Dunn
ables the models to adapt to new data features while et al., 2022).
maintaining their core capabilities (Bai et al., 2015).
During the experiment, we set the temperature (3) WebShop represents an innovative simulation
parameter of the test models to 0.7 to increase the of an e-commerce website environment, featur-
diversity of the content, and adjusted the top-p value ing 1.18 million real-world products and 12,087
to 0.95 to improve the precision of the generated con- crowd-sourced text instructions. This platform
tent. We employed a learning rate of 2e-4 and beta challenges agents to navigate through multiple
values of (0.9, 0.999) to ensure the stability of the types of webpages and perform a variety of
training process. The batch size was set to 4, with actions to find, customize, and purchase prod-
gradient accumulation, to ensure efficiency within ucts according to given instructions. WebShop
the limits of computational resources. To balance presents several challenges, including under-
innovation and coherence, we used LoRA parame- standing compositional instructions, query (re-
ters with a rank of 8 and an alpha value of 32, and )formulation, dealing with noisy text in web-
adjusted both the top-p value and the temperature pages, and conducting strategic exploration.
parameter to 0.7.These adjustments significantly en-
hanced the models’ flexibility and accuracy in han- (4) Knowledge Graphs The utilization of LLMs
dling personalized and emotionally rich content. in constructing and interacting with knowledge
graphs (KG) offers a promising avenue for en-
B Evaluation criteria hancing semantic understanding and informa-
tion retrieval. This involves assessing the mod-
(1) Operating systems Integrating LLMs into oper- els’ ability to not only generate but also inter-
ating systems offers vast potential for automat- pret complex interrelations within data, facilitat-
ing and optimizing tasks. This integration de- ing more intuitive and context-aware responses.
mands a secure, user-friendly interface for ef- The effectiveness of LLMs in this domain could
fective LLM-OS interaction and requires LLMs significantly improve AI’s capacity for reason-
to accurately understand the OS context for ing and decision-making based on structured
informed operations. Ensuring the safety of knowledge.
these operations is paramount to prevent mis-
use. Moreover, the system must handle errors (5) Mind2Web Mind2Web (M2W) is a dataset for
and provide clear feedback to users, enhanc- developing web agents that perform complex
ing interaction and control. Addressing these tasks on real websites via language instructions.
aspects can revolutionize computer interaction It features over 2,000 tasks across 137 sites from
and efficiency across industries. 31 domains. M2W’s real web environments and
diverse user interactions make it a crucial plat-
(2) Database Database (DB). Due to the crucial form for advancing AI navigation capabilities.
and challenging nature of database analysis
in many daily affairs, it is paramount to ex- (6) ALFWorld bridges interactive TextWorld en-
amine the abilities of LLMs to operate on vironments with embodied tasks from the AL-
real databases via SQL. Previous research has FRED dataset, enabling agents to learn abstract
placed significant emphasis on individual proce- strategies and apply them to real-world tasks. It
dures, such as showcasing the effectiveness of facilitates abstract reasoning and concrete exe-
Table 4: Distribution of various execution results across six tasks. (CLE: Exceeded Context Limit, TLE: Surpassed
Task Limit). Task limits exceeded are the main reason for incomplete tasks, pointing to limitations in LLM agents’
reasoning and decision-making within constrained timeframes.
OS DB KG ALF WS W2M
Completed 84.7 84.0 25.0 2.0 93.5 57.0
CLE 0.0 0.0 0.0 0.0 0.0 0.0
Invalid Format 0.0 3.0 0.0 0.0 0.0 0.0
Invalid Action 0.0 0.0 0.0 96.0 0.0 8.0
TLE 15.3 13.0 75.0 2.0 6.5 35.0
cution, allowing agents to plan actions in a text- performance tends to be limited by the quality and
based simulator and then execute these tasks in scope of its training data, particularly when faced
a visual environment. This approach enhances with novel or complex error types. This finding un-
agent generalization and problem-solving skills derscores the importance of high-quality prompts in
across various domains, such as language under- designing effective NLP models for complex tasks
standing and visual navigation, by leveraging like code correction. High-quality prompts not only
a modular design that simplifies research im- guide the model to more accurately understand and
provements. address specific programming issues but also reduce
noise in the processing, thereby improving overall
efficiency and accuracy.
C Details about Experiment
D Prompts for Tasks
In our research, we focused on exploring the impact
of meticulously crafted prompts on the performance We describe the task to the checker agent using the
of Natural Language Processing (NLP) models in following instruction as shown in Table 6 ∼ 12.
code correction tasks as shown in Table 5. By ex-
amining prompts of varying quality—high-quality, E Examples of Tasks
low-quality, and no prompts—we aimed to reveal All tasks in the dataset are divided into six categories.
how these factors influence the model’s capability, es- For each category, we present an example of success-
pecially in identifying and correcting programming ful execution as shown in Table 13∼ 22.
errors. Our findings indicate that fine-tuned high-
quality prompts significantly enhance the model’s
ability to write and rigorously check code, enabling
it to produce high-quality code and conduct stricter
code inspections. As demonstrated in the Table 23∼
25 .In contrast, low-quality prompts might lead to the
model generating nonsensical and ineffective code,
and in situations where correct code is provided, the
model might even alter it to incorrect code. The de-
cline in model output quality in these instances is
primarily due to the vague and unclear descriptions
provided by the prompts and the lack of specific
guidance, which prevents the model from effectively
utilizing the knowledge it acquired during training to
focus on key error points and make accurate correc-
tion decisions. Moreover, our study also showed that
in the absence of any prompt guidance, the model’s
Table 5: Evaluation Metrics Results
DATABASE(DB)
User
As the Database Inspector , your role involves overseeing and
validating the assistant ’ s interactions with a MySQL database .
The objective is a seamless collaboration to fulfill a specified
task . The process is straightforward : the assistant will submit
SQL queries , and it ’ s your responsibility to examine the SQL
syntax and logic to confirm their accuracy in addressing the
given question .
The mission is to collaborate effectively . You leverage your
expertise , guiding the assistant through the task . If the
assistant deviates from the correct approach , you ’ re expected to
provide guidance , offering both your insight and the correct
solution . Documenting your thought process is essential .
During each review cycle , you ’ re given the option to act by
providing a SQL command or conclude with an answer . Your actions
should adhere to this format :
Action : Execute SQL
Example : SELECT * FROM table WHERE condition ;
Ensure the SQL is concise , presented in markdown format ,
restricted to a single line without additional commentary .
Only one SQL statement should be executed at a time .
The assistant will execute only the first SQL code block you provide
and then share the output . Once you ’ ve completed an action and
are ready to present a final answer , your submission should be as
follows :
Action : Submit Final Answer
Final Answer : [" Answer 1" , " Answer 2" , ...]
Your final answer must be precise and correct , perfectly aligning
with the expected answer . If the task involves modifying the
database , the answer field might vary post - operation . Nonetheless
, any departure from the specified response format will lead to
an immediate failure of the task .
It ’s important to remember that you will be presented with the raw
MySQL response to analyze independently . Accuracy and correctness
are crucial in this joint effort .
Table 7: Prompt for KNOWLEDGE GRAPH
KNOWLEDGE GRAPH(KG)
User
As a reviewer , your task is to verify that the system for answering
questions based on a knowledge base ( KB ) operates correctly . To
achieve this goal , you will need to use the following tools to
review the query process in the knowledge base :
KNOWLEDGE GRAPH(KG)
WEBSHOP(WS)
User
As the Shopping Experience Auditor , you are charged with the task of
verifying that all actions undertaken in our web shopping
simulation adhere to the given instructions and are executed
correctly .
Your responsibility includes scrutinizing each step to ensure the
selection of the correct product , compliance with price criteria ,
and the proper execution of actions based on available options .
Should any discrepancies arise , it ’ s within your purview to
identify them and recommend appropriate corrections .
You are invited to specify any particular interactions for
verification , and you will conduct a thorough assessment to
guarantee the precision of our shopping procedure .
ALFWORLD(ALF)
User
As an Interactive Environment Auditor , your task is to meticulously
review the actions taken by the intelligent agent in the
household environment , ensuring they comply with the given
instructions and the range of available actions .
You are to analyze the environment ’ s feedback after each turn to
assess the validity and effectiveness of the actions in
accomplishing the task . Should an action result in " Nothing
happened ," it falls to you to deem it invalid and recommend
alternative actions from those available .
Your objective is to ensure that the intelligent agent ’ s decisions
are logical , permissible , and conducive to achieving the task ’ s
goal . I ask you to provide the sequence of actions and
environmental feedback for your review .
Table 11: Prompt for MIND2WEB
MIND2WEB(M2W)
User
Task Description :
[ Insert specific task description here ]
Please examine the HTML element and the task description . Determine
if the agent ’ s proposed action accurately fulfills the task
requirements .
Provide feedback on the appropriateness of the action . If it
deviates from what is required , kindly specify the correct action
that should be taken .
Table 12: Prompt for OPERATING SYSTEM
OPERATING SYSTEM(OS)
User
As the System Operation Audit Expert , your role is to evaluate the
assistant ’ s actions in the context of a simulated interaction
with a Linux ( Ubuntu ) operating system .
The assistant will perform tasks based on your requests and provide
explanations for their actions . For each round , review the
assistant ’ s thought process and the action taken , and determine
if the task has been correctly completed .
The assistant can take one of three actions :
1. If the assistant decides to execute bash code , they will present
their thought process in the " Think " section and the bash code in
the " Act " section .
2. If the assistant believes the task is finished , they will
indicate completion in the " Think " section and provide a
justification for why the task is considered complete .
3. If the assistant has an answer to a question , they will explain
how they arrived at the answer in the " Think " section and provide
the answer in the " Act " section .
Your task is to assess the assistant ’ s performance based on the
following criteria :
Accuracy : Is the action taken appropriate for the task described ?
Completeness : Has the assistant provided all necessary information
and performed all required steps ?
Justification : Is the thought process logical and does it support
the action taken ?
Please provide feedback on the assistant ’ s actions , including any
errors or omissions , and suggest corrections if necessary . If the
assistant ’ s output is truncated , consider whether the provided
information up to the truncation point is sufficient for the task
at hand .
OPERATING SYSTEM(OS)
BASH >
ls / etc
User : The output of the OS : cpi cron . hourly fuse . conf iproute2 lvm
networkd - dispatcher protocols selinux tmpfiles . d [ truncated
because the output is too long ]
Agent : Think : The output has been truncated because it is too long ,
so I need to count files by script directly .
Act : bash
<BASH >
ls -1 / etc | wc -l
Agent : Think : To find the total number of lines containing the word
" Linux " in all the ". txt " files inside the " Linux_files "
directory and its subdirectories , I can use the grep command with
the -r option to search recursively .
Act : bash
<BASH >
grep -r " Linux " ~/ Linux_files /*. txt | wc -l
Agent : Think : The output shows that there are 4 lines containing the
word " Linux " in all the ". txt " files inside the " Linux_files "
directory and its subdirectories .
Act : answer (4)
Table 14: Example for KNOWLEDGE GRAPH
KNOWLEDGE GRAPH(KG)
Agent : Thought : Now I need to find the city where Busch Stadium is
located . To achieve this , I will query the KB to find relations
connected to the variable #1.
Action : get_relations (#1)
Agent : Thought : Now I need to find the city where Busch Stadium is
located . To achieve this , I will query the KB to find relations
connected to the variable #2.
Action : get_relations (#2)
KNOWLEDGE GRAPH(KG)
ALFWORLD(ALF)
Agent Thought : " The task is to examine the mug with the desklamp . I
need to find where the mug is first and then find the desklamp to
light it up . I ’ ll start my search on desk 1."
Agent Thought : " The task is to examine the mug with the desklamp . I
need to find where the mug is first and then find the desklamp to
light it up . I ’ ll start my search on desk 1."
Agent Action : " examine desk 1"
Environment Feedback : " On the desk 1 , you see a bowl 1 , a desklamp
1, a laptop 1 , and a pen 1."
MIND2WEB(M2W)
User : Based on the HTML webpage above , try to complete the following
task :
Task : Find tuition & fees for full term graduate students in the
college of engineering for the academic year 2022 -2023
Previous actions :
[ link ] Costs & Financial Aid -> CLICK
[ combobox ] Academic Year -> SELECT : 2022 -2023
[ combobox ] College / School -> SELECT : College of Engineering
[ combobox ] Full or Half Term -> SELECT : Full Term
[ combobox ] Level of Study -> SELECT : Graduate
What should be the next action ? Please select from the following
choices ( If the correct action is not in the page above , please
select A . ’ None of the above ’) :
A. None of the above
B. <a id =0 > Graduation </a >
C. < select id =1 academic_year > < option all > Select </ option > < option
175 true >
D. < select id =2 college_school > < option all > Select </ option > <
option 17 > College
E. < select id =3 level_of_study > < option all > Select </ option > <
option 38 true >
F. < input id =4 submit apply / >
Agent : Thought : The search has already been set to College of
Engineering and Graduate level . Next , I should select Full Term .
Answer : E .
Action : SELECT
User : Based on the HTML webpage above , try to complete the following
task :
Task : Find tuition & fees for full term graduate students in the
college of engineering for the academic year 2022 -2023
Previous actions :
[ link ] Costs & Financial Aid -> CLICK
[ combobox ] Academic Year -> SELECT : 2022 -2023
[ combobox ] College / School -> SELECT : College of Engineering
[ combobox ] Full or Half Term -> SELECT : Full Term
[ combobox ] Level of Study -> SELECT : Graduate
What should be the next action ? Please select from the following
choices ( If the correct action is not in the page above , please
select A . ’ None of the above ’) :
A. None of the above
B. < input id =0 text search / >
C. <a id =1 > Tuition </a >
D. < div id =2 > Tuition & Fees </ div >
E. < span id =3 > Tuition & Fees </ span >
F. <a id =4 > Class Fees </a >
Table 18: Example for MIND2WEB
MIND2WEB(M2W)
Agent : Thought : The search has already been set to Tuition . Next , I
should find the tuition & fees for full term graduate students in
the college of engineering for the academic year 2022 -2023.
Answer : D .
Action : CLICK on the link with the text " Tuition & Fees "
User : Based on the HTML webpage above , try to complete the following
task :
Task : Find tuition & fees for full term graduate students in the
college of engineering for the academic year 2022 -2023
Previous actions :
[ link ] Costs & Financial Aid -> CLICK
[ combobox ] Academic Year -> SELECT : 2022 -2023
[ combobox ] College / School -> SELECT : College of Engineering
[ combobox ] Full or Half Term -> SELECT : Full Term
[ combobox ] Level of Study -> SELECT : Graduate
What should be the next action ? Please select from the following
choices ( If the correct action is not in the page above , please
select A . ’ None of the above ’) :
A. None of the above
B. < div id =0 > Tuition & Fees </ div >
C. < select id =1 level_of_study > < option all > Select </ option > <
option 38 true >
Agent : Thought : The search has already been set to graduate students
in the college of engineering for the academic year 2022 -2023.
Next , I should select the level of study .
Answer : C .
Action : SELECT : Graduate
Table 19: Example for WEBSHOP
WEBSHOP(WS)
User : You are web shopping . I will give you instructions about what
to do . You have to follow the instructions . Every round I will
give you an observation and a list of available actions , you have
to respond with an action based on the state and instruction .
You can use the search action if the search is available . You can
click one of the buttons in clickables . An action should be of
the following structure : search [ keywords ] or click [ value ]. If the
action is not valid , perform nothing . Keywords in search are up
to you , but the value in click must be a value in the list of
available actions . Remember that your keywords in search should
be carefully designed . Your response should use the following
format :
Thought :
I think ...
Action :
click [ something ]
Agent : Ok .
User : Observation :
" WebShop [ SEP ] Instruction : [ SEP ] I ’ m looking for rubber stole shoes
for light wearing it was brown in color , and price lower than
70.00 dollars [ SEP ] Search "
Available Actions :
{" has_search_bar ": true , " clickables ": [" search "]}
Agent : Thought :
I think I should use the search bar to look for the product I need .
Action :
search [ rubber stole shoes brown price < 70.00]
User : Observation :
Instruction : [ SEP ] I ’ m looking for rubber stole shoes for light
wearing it was brown in color , and price lower than 70.00 dollars
[ SEP ] Back to Search [ SEP ] Page 1 ( Total results : 50) [ SEP ] Next
> [ SEP ] B08XHSG67V [ SEP ] Crosley Furniture Madeleine 53.25"
Farmhouse Metal Kitchen Island in Brown / Black [ SEP ] 331.49 [ SEP ]
B07CZ5YKWW [ SEP ] Roper Tumbled Womens Brown Leather Petty Slip - On
Shoes [...]
Table 20: Example for WEBSHOP
WEBSHOP(WS)
[...]
Available Actions :
{" has_search_bar ": false , " clickables ": [" back to search " , " next >" ,
" b08xhsg67v " , " b07cz5ykww " , " b000jiatqs " , " b01ld5a59g " , "
b085wr7626 " , " b07s7qxncw " , " b01lykp60o " , " b004xj62re " , "
b07fy9mfhx " , " b09cfqjt7h "]}
Agent : Thought :
I think I should click on the product I need , which is B08XHSG67V .
Action :
click [ B08XHSG67V ]
User : Observation :
Instruction : [ SEP ] I ’ m looking for rubber stole shoes for light
wearing it was brown in color , and price lower than 70.00 dollars
[ SEP ] Back to Search [ SEP ] < Prev [ SEP ] Crosley Furniture
Madeleine 53.25" Farmhouse Metal Kitchen Island in Brown / Black [
SEP ] Price : $331 .49 [ SEP ] Rating : N . A . [ SEP ] Description [ SEP ]
Features [ SEP ] Reviews [ SEP ] Buy Now
Available Actions :
{" has_search_bar ": false , " clickables ": [" back to search " , " < prev " ,
" description " , " features " , " reviews " , " buy now "]}
Agent : Thought :
I think I should click on the ’ Buy Now ’ button to purchase the
product .
Action :
click [ Buy Now ]
Table 21: Example for DATABASE
DATABASE(DB)
Action : Operation
‘‘‘ sql
SELECT * FROM table WHERE condition ;
‘‘‘
You MUST put SQL in markdown format without any other comments . Your
SQL should be in one line . Every time you can only execute one
SQL statement . I will only execute the statement in the first SQL
code block . Every time you write a SQL , I will execute it for
you and give you the output . If you are done operating , and you
want to commit your final answer , then write down :
Action : Answer
Final Answer : [" ANSWER1 " , " ANSWER2 " , ...]
DO NOT write this pattern unless you are sure about your answer . I
expect an accurate and correct answer . Please note : Your answer
should be accurate . Your answer must be exactly the same as the
correct answer . If the question is about modifying the database ,
then after done operation , your answer field can be anything . If
your response cannot match any pattern I mentioned earlier , you
will be judged as FAIL immediately . Your input will be raw MySQL
response , you have to deal with it by yourself .
Table 22: Example for DATABASE
DATABASE(DB)
Agent : Ok
User : What event did he fight masato shiozawa ?\ nThe name of this
table is MMA Fight Record , and the headers of this table are Res ,
Record , Opponent , Method , Event , Round .
agent : To find the event that Masato Shiozawa and I fought , I can
use a SELECT statement to retrieve the ‘ Event ‘ column from the ‘
MMA Fight Record ‘ table . To specify the condition that Masato
Shiozawa was the opponent , I can use a WHERE clause with the
condition ‘ Opponent = ’ Masato Shiozawa ’ ‘.
Action : Operation
‘‘‘ sql
SELECT ‘ Event ‘ FROM ‘ MMA Fight Record ‘ WHERE ‘ Opponent ‘ = ’ Masato
Shiozawa ’
‘‘‘
agent : The SQL query executed successfully and returned the expected
answer . The event that Masato Shiozawa and I fought was ’ Shooto
2003 - 5/4 in Korakuen Hall ’. I can now commit my answer .
Action : Answer
Final Answer : [" Shooto 2003 - 5/4 in Korakuen Hall "]
Table 23: prompt - High-quality