0% found this document useful (0 votes)
28 views32 pages

CMAT

Uploaded by

ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views32 pages

CMAT

Uploaded by

ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing

Small Language Models


Xuechen Liang 1∗, Meiling Tao 2∗ , Tianyu Shi3†, Yiting Xie4
1
East China Jiaotong University, Nanchang, China
2
Guangdong University of Technology, Guangzhou, China
3
University of Toronto, Toronto, Canada
4
Genfun.ai, Beijing, China

Abstract
arXiv:2404.01663v2 [cs.CL] 4 Apr 2024

Open large language models (LLMs) have sig-


nificantly advanced the field of natural language
processing, showcasing impressive performance
across various tasks. Despite the significant ad-
vancements in LLMs, their effective operation
still relies heavily on human input to accurately
guide the dialogue flow, with agent tuning be-
ing a crucial optimization technique that involves
human adjustments to the model for better re-
sponse to such guidance. Addressing this depen-
dency, our work introduces the TinyAgent model,
trained on a meticulously curated high-quality
dataset. We also present the Collaborative Multi-
Agent Tuning (CMAT) framework, an innovative
system designed to augment language agent ca-
pabilities through adaptive weight updates based
on environmental feedback. This framework fos- Figure 1: TinyAgent demonstrates outstanding perfor-
ters collaborative learning and real-time adapta- mance, comparable to that of GPT-3.5. TinyAgent is a
tion among multiple intelligent agents, enhancing series of models fine-tuned based on Qwen (Bai et al.,
their context-awareness and long-term memory. 2023) and Codellama (Roziere et al., 2023).
In this research, we propose a new communica-
tion agent framework that integrates multi-agent
systems with environmental feedback mecha- architecture and effectively capture long-distance de-
nisms, offering a scalable method to explore co- pendencies through multi-head self-attention mecha-
operative behaviors. Notably, our TinyAgent-7B nisms, demonstrating strong capabilities across var-
model exhibits performance on par with GPT- ious NLP tasks. With technological advancements,
3.5, despite having fewer parameters, signifying the performance and application scope of LLMs
a substantial improvement in the efficiency and continue to expand, promising significant improve-
effectiveness of LLMs 1 .
ments in computational efficiency and functional-
1 Introduction ity, including anticipated advanced features such as
self-improvement, self-checking, and sparse expert
In the rapid development of the field of artificial in-
models (Liu et al., 2023).
telligence, large language models (LLMs) such as
However, it is noteworthy that the success of these
BERT and GPT-4 (OpenAI, 2023) have become
models largely depends on human input to guide the
important cornerstones of natural language process-
correct dialogue. This dependency requires users
ing (NLP). These models utilize the Transformer

to provide relevant and precise prompts based on
Equal contribution.

∗ their intentions and the feedback from the chat agent,
Corresponding author: ty.shi@mail.utoronto.ca
1
Access models, demos at raising a critical question: Can we replace human in-
https://github.com/heimy2000/CMAT tervention with autonomous communication agents
capable of steering conversations towards task com- resents an innovative approach that allows for
pletion with minimal human supervision? dynamic and real-time memory updates within
Our research is driven by the need to overcome the multi-agent systems.
significant challenges faced by LLMs in real-world
deployments, particularly the high computational • We design a novel role-playing mechanism for
resource requirements, data biases, and lack of ro- precise task allocation and enhanced agent com-
bustness. These issues limit their applicability in munication, significantly boosting overall per-
resource-constrained environments and highlight the formance and cooperation.
urgency of enhancing model efficiency and adaptabil- • We evaluated the fine-tuned TinyAgent models
ity (Abid et al., 2021; Du et al., 2022). As demon- across multiple agent tasks, finding that in cer-
strated by Figure 1, we aim to address these limita- tain scenarios, their performance rivals that of
tions by optimizing models and training methods to advanced LLMs like GPT-4 and agentlm (Zeng
enable smaller models to match the performance lev- et al., 2023), demonstrating the potential effi-
els of larger models. Additionally, recognizing the ciency and capabilities of compact models.
potential of MAS to improve processing efficiency
and system adaptability through agent cooperation, 2 Related Work
we seek to develop a collaborative agent framework.
This framework aims to facilitate effective coopera- 2.1 LLMs Applications in a Multi-Agent
tion among agents, thereby overcoming the perfor- Framework
mance gap and propelling further research and inno- We explore the applications of LLMs within multi-
vation in the field of LLMs (Ferry et al., 2018; Talwar agent systems, highlighting their role versatility as
et al., 2005). In our experiments, we evaluated the users, assistants, and checkers, and their capability to
capabilities of large models with and without the use offer bespoke support and solutions across such envi-
of prompts and observed that low-quality prompts ronments (de Zarzà et al., 2023; Talebirad and Nadiri,
can significantly degrade model performance. Con- 2023). LLMs showcase remarkable adaptability to
sequently, we propose the Collaborative Multi-Agent tasks through methods like supervised fine-tuning
Tuning (CMAT) framework. and real-time feedback learning, notably in tasks
The CMAT framework introduces a structured en- that require a sophisticated understanding and execu-
vironment where individual agents, each with spe- tion related to operating systems or databases (Chris-
cialized roles and capabilities, work together to pro- tianos et al., 2023; Li et al., 2023). Furthermore,
cess information, make decisions, and solve com- LLMs are adept at enhancing communication and
plex tasks (Hernández-Orallo et al., 2017). By shar- collaboration among agents, a critical component
ing insights and learning from interactions within for addressing complex issues that necessitate multi-
this multi-agent ecosystem, the framework allows role coordination (Zhao et al., 2021). Nevertheless,
for a more scalable and flexible approach to train- LLMs encounter specific challenges within multi-
ing LLMs (Lewis et al., 2017). This collaborative agent frameworks, especially in situations that de-
effort not only helps in bridging the gap in perfor- mand a nuanced contextual comprehension and sus-
mance between smaller and larger models but also tained memory retention, as well as adapting to fast-
fosters a more resilient system capable of adapting evolving environments and unforeseeable tasks (Di-
to new challenges without extensive human inter- allo et al., 2020). Issues such as data bias, security
vention (Kaplan et al., 2020). Through CMAT, we concerns, and the intricacies of crafting effective
aim to push the boundaries of what is possible with protocols for multi-agent cooperation stand as sig-
LLMs, making them more accessible and effective nificant hurdles in this domain (Zhang et al., 2017;
for a wider range of applications (Rajpurkar et al., García et al., 2015). Thus, by summarizing LLMs’
2018). roles in multi-agent frameworks, we underscore the
The main contributions of our work are as follows: critical need for continued innovation and research
exploration, aimed at overcoming these technologi-
• We propose the CMAT framework which rep- cal hurdles and leveraging the full potential of LLMs
Figure 2: In the CMAT framework, the user assigns tasks to an assistant, which generates SQL commands based on
short-term and long-term memories: short-term memory provides immediate context from trajectory history, while
self-reflective outputs are stored as long-term memory. The checker verifies the correctness of SQL commands before
they are executed in the environment.

in complex systems (Lu and Zhang, 2020). compassing 1.8 billion parameters, can achieve per-
To enhance the adaptability and collaborative ca- formance levels akin to those of models with greater
pabilities of LLMs in multi-agent systems, we’ve parameter counts, like 6 billion parameters, when
implemented memory modes, including long-term supported by high-quality datasets (Stiennon et al.,
support and short-term memory with environmental 2020). This demonstrates that excellent data quality
feedback (Liang et al., 2016). This allows LLMs to and appropriate tuning strategies play a decisive role
better interact, learn, and adapt in dynamic environ- in the performance of LLMs. Therefore, investing
ments, leveraging past experiences and responding efforts in improving data quality and choosing the
to changes swiftly. right tuning methods is essential for achieving op-
timal performance of LLMs in various application
2.2 The tuning method for LLMs scenarios (Howard and Ruder, 2018). Through our
work combining supervised fine-tuning with rein-
The main tuning methods include supervised fine- forcement learning, we’ve notably advanced LLM
tuning and reinforcement learning (Ouyang et al., performance across a spectrum of tasks, showcas-
2022). Supervised fine-tuning enhances performance ing significant improvements in task-specific bench-
by training models on specific task datasets, and is marks (Ouyang et al., 2022).
especially suitable for tasks such as natural language
understanding (NLU) (Howard and Ruder, 2018). 3 Methodology
On the other hand, reinforcement learning, guided by
reward mechanisms, is suitable for handling complex Our work focuses on the design and implementation
and variable tasks (Mnih et al., 2015). The effective of a multi-agent LLM tuning framework, enhanc-
combination of these two methods can significantly ing decision-making quality, controllability, and ef-
improve the performance of LLMs in various tasks. ficiency in complex systems through collaborative
Notably, LLMs of reduced scale, such as those en- communication and task completion among differ-
Figure 3: Comparative study of Llama-2-7b and TinyAgent-7b in DataBase cases. (1) In DataBase tasks with a
reflection mechanism, Llama-2-7b still made errors after reflection, while TinyAgent-7b adjusted its operations after
reflecting on its first failed attempt. (2) Without a reflection mechanism, TinyAgent-7b repeated the same operation and
ultimately failed to complete the task.

ent agent roles (i.e., User, Assistant, Checker) (Lu 2019). In traditional reinforcement learning (RL)
et al., 2022). It emphasizes the use of supervised fine- frameworks, the Actor decides on actions based on
tuning techniques like LoRA (Hu et al., 2021) and a specific policy and receives information about the
PTuning (Lester et al., 2021) based on pre-trained current state of the environment. In the CMAT frame-
datasets such as AgentBench, and further tuning work, the Actor model utilizes approaches like Chain
through environment interaction and memory up- of Thought and ReAct, enhancing the traditional text
dates, akin to the process of Reinforcement Learning generation models to delve into different facets of
from Human Feedback (RLHF) (Vázquez-Canteli behavior generation.
and Nagy, 2019). In the CMAT framework as shown in the algo-
rithm 1, the "Checker" plays a vital role, specifically
3.1 Actor-Evaluator Dynamics in the CMAT tasked with verifying the correctness of the outputs
Framework produced by the Actor. It conducts a thorough evalua-
In multi-agent systems, the Actor plays a crucial tion process to ensure that the generated outputs meet
role in behavior generation through LLMs. The Ac- the standards and requirements of the specific task.
tor responds to environmental state observations to This role, by guaranteeing the accuracy and quality
generate text and actions, while also using reflec- of the outputs, plays a critical role in overseeing and
tion to assess and adjust its behavior (Wang and Shi, ensuring the overall performance quality of the sys-
Algorithm 1 CMAT Framework ent context, especially when managing the intri-
1: Initialize: LLMs, User, Assistant, Checker, task cate interactions and decisions in multi-agent en-
set T , policy π0 , memory mem, Actor θactor , vironments. Long-term memory enhances LLMs’
Critic θcritic . decision-making by allowing them to draw on past
2: for B ∈ T do experiences when facing new challenges, thereby im-
3: Execute Task: Assign Assistant and proving both their adaptability and response speed.
Checker. For example, by recalling relevant past interactions,
4: while not complete(B) do LLMs can develop more precise strategies for new
5: a ← Action from Assistant via LLMs; tasks (Penta, 2020).
Execute a for (s′ , r).
Short-Term Memory and Environmental
6: if Checker verifies a then
Feedback
7: Update s′ , mem+ = (s′ , r).
8: else Short-term memory, in contrast, focuses on the
9: Adjust π, LLMs; retry B. immediate context and the rapid processing of
10: end if new information (Pae et al., 2012; Liu and Guo,
11: end while 2019). This is particularly vital in dynamic environ-
12: Policy Update: Use Actor-Critic Method to ments where conditions and requirements can shift
refine π with feedback. quickly (Martin, 1993). Short-term memory enables
13: Calculate TD Error δt = r + γV (st+1 ) − LLMs to adapt to these changes effectively, ensur-
V (st ). ing timely and relevant responses to new challenges
14: Update Actor θactor using gradient ascent: or tasks. Environmental communication feedback
θactor ← θactor + α∇θactor log π(a|s)δt . mechanisms complement these memory modes by
15: Update Critic θcritic using TD Error: θcritic ← providing real-time inputs and interactions within the
θcritic + βδt ∇θcritic V (s). multi-agent framework (Yogatama et al., 2021). This
16: Update Checker Strategy: Revise the strat- continuous flow of information helps LLMs to adjust
egy of Checker to align with the updated policy their strategies based on the latest data, facilitating a
π and the latest environmental states and predic- more fluid and responsive adaptation process. The
tions. feedback loop is essential for refining the actions and
17: end for decisions of LLMs, allowing for incremental learn-
18: Complete: Verify all T ; ing and optimization based on immediate outcomes
19: Output: Final states, evaluations. and environmental cues (Davelaar et al., 2005).

The Reflexion Process


tem. Through this mechanism, the Checker provides Artificial intelligence systems are increasingly adopt-
essential feedback and guidance to the Actor’s be- ing self-reflection mechanisms to enhance their
havior generation process, aiding in the continuous decision-making capabilities. By systematically re-
optimization and adjustment of strategies for more viewing past actions and their outcomes, these sys-
efficient and accurate decision-making outputs. tems can identify patterns and make informed adjust-
ments to their strategies.
3.2 Real-Time Review Loop
The CMAT framework introduces a Reflexion Pro-
Long-term memory cess of collaboration between computation and cog-
In the context of LLMs used within multi-agent nition by initializing LLMs, users, assistants, check-
systems, the significance of long-term memory is ers, and a set of tasks B (Li et al., 2018). Each
paramount (Kato et al., 2022). It acts not merely as task B is executed by an assistant and verified by a
an information store but as a comprehensive knowl- checker. The assistant selects an action a through in-
edge management system that enables LLMs to store teraction with LLMs, obtaining a new state s′ and re-
and access crucial data over time (Bhadra, 2022). ward r (Che et al., 2021). If the checker verifies a as
This is vital for maintaining a dynamic and coher- correct, the system updates the state to s′ and stores
Table 1: Evaluation of Code Correction

Model BLEU-4 ROUGE-1 ROUGE-2 ROUGE-L


codellama-7b 25.01 45.91 29.83 26.24
codellama-13b 26.96 45.31 29.54 25.91
tinyllama-1.8b 43.38 59.86 37.81 42.86

(s′ , r) in memory mem; otherwise, it adjusts strategy and incorporates both short-term and long-term mem-
π and retries. The entire process not only involves ory updates for efficient decision-making. Our ap-
confirmation of correct actions but also includes up- proach, enhanced with self-reflection and experience
dating the strategy π based on feedback from the replay, offers deeper understanding and better re-
checker, creating a continuous cycle of learning and sponse to complex scenarios, leading to more precise
adjustment. Each iteration aims to optimize the assis- and rational decisions in changing conditions.
tant’s decision-making strategy through practice and
reflection, enhancing the efficiency and accuracy of 4 Experiments
task execution. Once all tasks are verified, it outputs
the final states and evaluation results of all tasks (Sil- Our evaluation framework rigorously tests intel-
ver et al., 2017).Within this cycle, there is also a ligent agents in six key domains to ensure their
critical step: updating the checker’s strategy. This readiness for diverse real-world challenges (Ross
means that after each update of the strategy π, not et al., 2023). These areas include seamless LLM
only does the assistant’s strategy need to be adjusted integration into OS with an emphasis on security
based on feedback, but the checker’s verification and user interaction; proficiency in real DB oper-
strategy also needs to be updated synchronously to ations using SQL (Halevy et al., 2004); task exe-
ensure its decision-making logic is consistent with cution on the simulated e-commerce platform Web-
the latest environmental state and strategy. Such up- Shop(WS) (Yao et al., 2022); constructing and using
dates ensure the checker can accurately assess the KGs for enhanced semantic understanding; employ-
assistant’s actions, further enhancing the system’s ing the M2W dataset for complex web tasks, marking
overall performance and reliability. Through this the first dataset for developing general web agents
approach, the CMAT framework achieves a deeper following language instructions; and applying ab-
level of collaboration and self-optimization between stract reasoning and visual tasks in the text-based
computation and cognition, enabling AI systems to ALFWorld(ALF) (Shridhar et al., 2021). For more
face new challenges and environments with higher implementation and evaluation details, see Appen-
efficiency and adaptability. dices A and B.

Checker-In-The-Loop 4.1 Dataset


To enhance the controllability of the role-playing The dataset for our research was meticulously con-
framework, we have introduced an inspector agent structed to comprehensively evaluate the capabili-
that can inspect the role-playing (Tao et al., 2023) ties of agents (Gou et al., 2020). It was established
agents for any issues. This enables a decision- through self-collected methods, aimed at providing a
making process for task solving similar to tree search. rich and diverse testing environment to thoroughly
In practice, the inspector can be either an AI agent assess the performance of deep learning models
or a human. across various tasks (Sachdeva and McAuley, 2023).
The construction of the dataset included key pro-
3.3 Compared to Existing Practice cesses such as data collection, filtering, enhancement,
Our method stands out by dynamically updating and knowledge distillation (Chen and Liu, 2018).
memory through real-time interactions, enabling Through detailed screening and processing, we en-
rapid adaptation to new experiences. It uses direct sured the accuracy and consistency of the dataset,
feedback from the environment for quick learning retaining only high-quality samples directly related
Table 2: Test set results of AGENTBENCH. Comparison between API-based models and open-source models. Bold:
The best among API-based and open-source models.

LLM Type Models VER OS DB KG ALF WS M2W


gpt-3.5-turbo 0613 31.6 15.7 25.9 16.0 64.1 16.0
gpt-4 0613 42.4 32.0 58.8 78.0 61.6 29.0
API
text-davinci-003 - 20.1 16.3 34.9 20.0 61.7 26.0
text-davinci-002 - 8.3 16.7 41.5 16.0 56.3 9.0
tinyllama-1.1b (Zhang et al., 2024) - 2.8 0.0 0.0 0.0 0.0 0.0
opt-1.3b (Zhang et al., 2022) - 0.7 0.0 0.0 0.0 0.0 0.0
opt-2.7b - 1.4 0.0 0.0 0.0 0.0 0.0
qwen-1.8b chat 10.4 22.67 6.8 0.0 26.6 5.0
chatglm2-6b 2 v1.1 4.2 1.3 0.0 0.0 0.0 0.0
codellama-7b instruct 9.7 2.7 0.0 0.0 14.3 5.0
OSS llama2-7b (Touvron et al., 2023) chat 0.0 4.2 8.0 0.0 11.6 7.0
zephyr-7b (Tunstall et al., 2023) alpha 12.5 9.7 5.0 8.0 45.0 11.0
baichuan2-6b (Yang et al., 2023) chat 2.8 9.7 0.0 0.0 6.1 11.0
mpt-7b 3 chat 5.6 9.7 12.7 0.0 0.0 0.0
qwen-7b chat 12.5 13.0 7.0 34.3 0.0 0.0
agentlm-7b chat 14.6 33.0 9.0 16.4 18.4 10.0
agentlm-7b(SFT) chat 17.4 37.0 10.0 17.4 26.6 10.0
tinyagent-1.8b chat 17.7 28.33 48.0 6.0 32.7 11.0
tinyagent-7b chat 23.1 41.3 28.0 8.0 58.7 12.0

to the testing objectives (Sachdeva and McAuley, models. This performance was not only significant
2023). Faced with issues of data imbalance and in- in code correction tasks but also prominent in other
sufficient samples, we utilized data augmentation checking tasks such as OS configuration, DB query
and knowledge distillation techniques. Knowledge optimization, and WS management. These findings
distillation helped us to extract the most valuable and highlight that TinyAgent-1.8B not only possesses
representative information from the vast amount of efficient code analysis capabilities but is also widely
collected data, thus building an efficient and refined applicable to the inspection and optimization of other
testing dataset. This process significantly improved complex systems.
the quality and applicability of the dataset, providing
a solid foundation for evaluating the capabilities of 4.3 Baselines
model agents (Mishra and Marr, 2017). In the baseline section of our study, we’ve selected
Qwen-1.8B and CodeLlama-7B as pivotal bench-
4.2 Evaluating Code Correction marks to assess the TinyAgent series’ performance,
As shown in the Table 1, in this study, we con- excluding the CMAT framework’s influence.
ducted a comprehensive performance evaluation of
TinyAgent-1.8B and the CodeLlama series models 4.4 Results analysis
(CodeLlama7B and CodeLlama13B), aiming to ex- The results in Table 2 underscore the effectiveness
plore their multi-task checking capabilities, includ- of our fine-tuning methods, especially for the TinyA-
ing but not limited to code correction, OS configu- gent models. Tinyagent-1.8B demonstrates signifi-
ration, DB query optimization, and WS. The experi- cant performance in the KG task, on par with ad-
mental results showed that TinyAgent-1.8B demon- vanced models like GPT-3.5. Tinyagent-7B also
strated a significant advantage in cross-task perfor- showcases its strengths, notably in the DB task,
mance evaluation compared to the CodeLlama series where it surpasses its foundational model (Antonello
Table 3: Ablation study on the effect of agent and general instructions.

Models OS DB KG ALF WS M2W


tinyagent-7b 27.3 43.0 38.0 10.0 61.8 14.0
- agent only 20.1 39.3 25.0 2.0 55.7 7.0
- general only 9.7 5.4 0.0 0.0 26.6 5.0

et al., 2020), CodeLlama-7B, and offers competi- lem analysis and reflection demonstrate its advanced
tive scores against GPT-4. These findings indicate problem-solving skills and deep understanding of
the TinyAgent models’ capacity to match or even issues.
surpass models with larger parameters in certain as-
pects. Moreover, the CMAT framework’s potential 4.6 Ablation Study
to enhance the capabilities of smaller-scale models The Table 3 presents an ablation study on the
is highlighted, allowing the TinyAgent models to TinyAgent-7B model, delineating the impact of
closely compete with the performance of advanced agent-specific and general instructions on task per-
models such as GPT-4. formance. The composite model, TinyAgent-7B,
As illustrated in Figure 1, Our comparative anal- demonstrates the highest efficacy, notably in WS and
ysis indicates that Tinyagent models, refined from DB tasks, which implies its adeptness in handling
Qwen-1.8B and CodeLlama-7B, exhibit superior per- complex e-commerce interactions and database man-
formance to their base models. The incorporation of agement. The agent-only variant exhibits a decline in
the CMAT framework further amplifies their func- performance, suggesting that while task-specific in-
tionality, equipping these small Models to match the structions are crucial, they are not wholly sufficient
capabilities of GPT-3.5. This performance boost is for the breadth of tasks such as KG. The general-
credited to CMAT’s optimization of model interac- only model’s performance is considerably reduced
tions and its strategic use of memory modes for spe- across all tasks, with a complete inability to perform
cific tasks, confirming its effectiveness in enhancing in KG and ALF, highlighting the indispensability of
the sophistication of fine-tuned models (Deshpande agent-specific instructions. This data underscores
et al., 2021). the necessity of integrating both agent-specific and
4.5 Error analysis general instructions to enhance the versatility and
effectiveness of AI models in diverse task domains.
In our testing framework’s error analysis, we ob-
served common challenges in DB tasks faced by 5 Conclusions
models, such as difficulties in understanding user
requests, executing actions, and pre-action problem The main findings of our work reveal that care-
analysis. Many models simply respond with "OK" to fully trained small-parameter models on excellent
specific instructions without performing actual SQL datasets can achieve performance comparable to that
operations, indicating a gap in transforming user re- of large-parameter models. With the application of
quests into database actions. Models often provide the CMAT framework, we further demonstrate the
superficial acknowledgments without delivering pre- significant potential for performance improvement
cise execution or in-depth problem analysis, failing in large-parameter models, highlighting the impor-
to meet user expectations. In contrast, the TinyAgent tance of model design and optimization strategies
series excels in understanding and converting user for parameter size. In our evaluation, although most
requests into actual SQL operations, effectively com- open-source LLMs performed poorly compared to
prehending and executing tasks. It provides clear API-provided models without optimization, some
responses and adheres to user-specified SQL for- models displayed similar capabilities to API models
mats, fulfilling user expectations comprehensively. after meticulous fine-tuning of the TinyAgent model.
Additionally, TinyAgent’s thorough pre-action prob- This finding emphasizes not only the importance of
parameter size in handling real-world environmental IEEE International Symposium on Software Reliability
interactions but also showcases the enormous po- Engineering Workshops (ISSREW), pages 114–117.
tential of even smaller models through the CMAT Chunjiang Che, Xiaoli Li, Chuan Chen, Xiaoyu He, and
framework and precise adjustment strategies. Zibin Zheng. 2021. A decentralized federated learn-
ing framework via committee mechanism with conver-
Limitations gence guarantee. IEEE Transactions on Parallel and
Distributed Systems, 33:4783–4800.
In this study, we demonstrated the potential for
performance improvement by applying the CMAT C. L. P. Chen and Zhulin Liu. 2018. Broad learning sys-
tem: An effective and efficient incremental learning
framework to TinyAgent series models and other system without the need for deep architecture. IEEE
large language models (LLMs). However, there are Transactions on Neural Networks and Learning Sys-
clear limitations to the research: First, although most tems, 29:10–24.
models showed improved performance, some mod- Filippos Christianos, Georgios Papoudakis, Matthieu Zim-
els saw limited improvement due to weaker base mer, Thomas Coste, Zhihao Wu, Jingxuan Chen, Khy-
agent capabilities, indicating that the effectiveness ati Khandelwal, James Doran, Xidong Feng, Jiacheng
of the CMAT framework might vary significantly Liu, et al. 2023. Pangu-agent: A fine-tunable gener-
alist agent with structured reasoning. arXiv preprint
between different models; second, the limitations of arXiv:2312.14878.
datasets and task types could affect the broad applica-
bility of the conclusions, while low-quality datasets E. Davelaar, Y. Goshen-Gottstein, Amir Ashkenazi,
could negatively impact model performance; lastly, H. Haarmann, and M. Usher. 2005. The demise of
short-term memory revisited: empirical and computa-
although evaluations based on AgentBench ensured tional investigations of recency effects. Psychological
fairness, they might not fully reflect the complexity review, 112 1:3–42.
of real-world scenarios, and due to computational
I de Zarzà, J de Curtò, Gemma Roig, Pietro Manzoni,
resource constraints, larger-scale models could not and Carlos T Calafate. 2023. Emergent cooperation
be tested. This underscores the importance of future and strategy adaptation in multi-agent systems: An
work to consider a wider range of models, datasets, extended coevolutionary theory with llms. Electronics,
and task types, especially the implementation of op- 12(12):2722.
timization strategies and framework applications in A. Deshpande, A. Achille, Avinash Ravichandran, Hao
resource-constrained situations. Li, L. Zancato, Charless C. Fowlkes, Rahul Bhotika,
Stefano Soatto, and P. Perona. 2021. A linearized
framework and a new benchmark for model selection
References for fine-tuning. ArXiv, abs/2102.00084.

Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Elhadji Amadou Oury Diallo, Ayumi Sugiyama, and
Large language models associate muslims with vio- T. Sugawara. 2020. Coordinated behavior of coop-
lence. Nature Machine Intelligence, 3(6):461–463. erative agents using deep reinforcement learning. Neu-
rocomputing, 396:230–240.
Richard J. Antonello, Javier Turek, and Alexander G.
Huth. 2020. Selecting informative contexts improves Mengnan Du, Fengxiang He, Na Zou, Dacheng Tao, and
language model fine-tuning. ArXiv, abs/2005.00175. Xia Hu. 2022. Shortcut learning of large language
models in natural language understanding: A survey.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, arXiv preprint arXiv:2208.11857.
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei
Huang, et al. 2023. Qwen technical report. arXiv Alexander Dunn, John Dagdelen, Nicholas Walker,
preprint arXiv:2309.16609. Sanghoon Lee, Andrew S Rosen, Gerbrand Ceder,
Kristin Persson, and Anubhav Jain. 2022. Structured
Yalong Bai, Kuiyuan Yang, Wei Yu, Chang Xu, Wei- information extraction from complex scientific text
Ying Ma, and T. Zhao. 2015. Automatic image dataset with fine-tuned large language models. arXiv preprint
construction from click-through logs using deep neural arXiv:2212.05238.
network. Proceedings of the 23rd ACM international
conference on Multimedia. N. Ferry, F. Chauvel, Hui Song, A. Rossini, Maksym
Lushpenko, and Arnor Solberg. 2018. Cloudmf:
Sushovan Bhadra. 2022. A stochastic petri net model of Model-driven management of multi-cloud applications.
continuous integration and continuous delivery. 2022 ACM Trans. Internet Techn., 18:16:1–16:24.
Eloy García, Yongcan Cao, and D. Casbeer. 2015. Peri- Xin Liang, G. Shen, and Shanshan Bu. 2016. Multia-
odic event-triggered synchronization of linear multi- gent systems in construction: A ten-year review. Jour-
agent systems with communication delays. IEEE nal of Computing in Civil Engineering, 30:04016016–
Transactions on Automatic Control, 62:366–371. 04016016.

Jianping Gou, B. Yu, S. Maybank, and D. Tao. 2020. Gang Liu and Jiabao Guo. 2019. Bidirectional lstm with
Knowledge distillation: A survey. International Jour- attention mechanism and convolutional layer for text
nal of Computer Vision, 129:1789 – 1819. classification. Neurocomputing, 337:325–338.

A. Halevy, Z. Ives, J. Madhavan, P. Mork, Dan Suciu, and Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu
I. Tatarinov. 2004. The piazza peer data management Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men,
system. IEEE Transactions on Knowledge and Data Kejuan Yang, et al. 2023. Agentbench: Evaluating
Engineering, 16:787–798. llms as agents. arXiv preprint arXiv:2308.03688.

José Hernández-Orallo, Marco Baroni, Jordi Bieger, Kai Lu and Chongyang Zhang. 2020. Blockchain-based
Nader Chmait, David L Dowe, Katja Hofmann, multiparty computation system. In 2020 IEEE 11th
Fernando Martínez-Plumed, Claes Strannegård, and International Conference on Software Engineering and
Kristinn R Thórisson. 2017. A new ai evaluation cos- Service Science (ICSESS), pages 28–31. IEEE.
mos: Ready to play the game? AI Magazine, 38(3):66–
69. Zehui Lu, Wanxin Jin, S. Mou, and B. Anderson. 2022.
Cooperative tuning of multi-agent optimal control sys-
Jeremy Howard and Sebastian Ruder. 2018. Universal lan- tems. 2022 IEEE 61st Conference on Decision and
guage model fine-tuning for text classification. pages Control (CDC), pages 571–576.
328–339.
R. Martin. 1993. Short-term memory and sentence pro-
cessing: Evidence from neuropsychology. Memory
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
and Cognition, 21:176–183.
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu
Chen. 2021. Lora: Low-rank adaptation of large lan- Asit K. Mishra and Debbie Marr. 2017. Apprentice: Us-
guage models. arXiv preprint arXiv:2106.09685. ing knowledge distillation techniques to improve low-
precision network accuracy. ArXiv, abs/1711.05852.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B
Brown, Benjamin Chess, Rewon Child, Scott Gray, Volodymyr Mnih, K. Kavukcuoglu, David Silver, An-
Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. drei A. Rusu, J. Veness, Marc G. Bellemare, A. Graves,
Scaling laws for neural language models. arXiv Martin A. Riedmiller, A. Fidjeland, Georg Ostrovski,
preprint arXiv:2001.08361. Stig Petersen, Charlie Beattie, Amir Sadik, Ioannis
Antonoglou, Helen King, D. Kumaran, Daan Wierstra,
Daiju Kato, Ayumu Shimizu, and Hiroshi Ishikawa. 2022. S. Legg, and D. Hassabis. 2015. Human-level control
Quality classification for testing work in devops. In through deep reinforcement learning. Nature, 518:529–
Proceedings of the 14th International Conference on 533.
Management of Digital EcoSystems, pages 156–162.
OpenAI. 2023. Gpt-4 technical report.
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
The power of scale for parameter-efficient prompt tun- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
ing. arXiv preprint arXiv:2104.08691. roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, J. Schul-
Mike Lewis, Denis Yarats, Yann N Dauphin, Devi Parikh, man, Jacob Hilton, Fraser Kelton, Luke E. Miller, Mad-
and Dhruv Batra. 2017. Deal or no deal? end-to- die Simens, Amanda Askell, P. Welinder, P. Christiano,
end learning for negotiation dialogues. arXiv preprint J. Leike, and Ryan J. Lowe. 2022. Training language
arXiv:1706.05125. models to follow instructions with human feedback.
ArXiv, abs/2203.02155.
Guanbin Li, Yukang Gan, Hejun Wu, Nong Xiao, and
Liang Lin. 2018. Cross-modal attentional context Hye K Pae et al. 2012. Linguistic relativity revisited: The
learning for rgb-d object detection. IEEE Transactions interaction between l1 and l2 in thinking, learning, and
on Image Processing, 28:1591–1601. production. Psychology, 3(01):49.

Guohao Li, Hasan Abed Al Kader Hammoud, Hani M. Di Penta. 2020. Understanding and improving contin-
Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. uous integration and delivery practice using data from
Camel: Communicative agents for" mind" exploration the wild. Proceedings of the 13th Innovations in Soft-
of large scale language model society. arXiv preprint ware Engineering Conference on Formerly known as
arXiv:2303.17760. India Software Engineering Conference.
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Shengyi Huang, Leandro von Werra, Clémentine
Know what you don’t know: Unanswerable questions Fourrier, Nathan Habib, et al. 2023. Zephyr: Di-
for squad. arXiv preprint arXiv:1806.03822. rect distillation of lm alignment. arXiv preprint
arXiv:2310.16944.
Steven I. Ross, Fernando Martinez, Stephanie Houde,
Michael J. Muller, and Justin D. Weisz. 2023. The pro- José R Vázquez-Canteli and Zoltán Nagy. 2019. Rein-
grammer’s assistant: Conversational interaction with a forcement learning for demand response: A review of
large language model for software development. ArXiv, algorithms and modeling techniques. Applied energy,
abs/2302.07080. 235:1072–1089.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Guihong Wang and Jinglun Shi. 2019. Actor-critic for
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu multi-agent system with variable quantity of agents. In
Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: IoT as a Service: 4th EAI International Conference,
Open foundation models for code. arXiv preprint IoTaaS 2018, Xi’an, China, November 17–18, 2018,
arXiv:2308.12950. Proceedings 4, pages 48–56. Springer.
Noveen Sachdeva and Julian McAuley. 2023. Data distil- Jun Wang, Yong-Hong Sun, Z. Fan, and Yan Liu. 2005. A
lation: A survey. ArXiv, abs/2301.04272. collaborative e-learning system based on multi-agent.
pages 455–463.
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté,
Yonatan Bisk, Adam Trischler, and Matthew Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang,
Hausknecht. 2021. ALFWorld: Aligning Text and Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang,
Embodied Environments for Interactive Learning. In Dong Yan, et al. 2023. Baichuan 2: Open large-scale
Proceedings of the International Conference on Learn- language models. arXiv preprint arXiv:2309.10305.
ing Representations (ICLR).
Shunyu Yao, Howard Chen, John Yang, and Karthik
David Silver, Julian Schrittwieser, Karen Simonyan, Ioan-
Narasimhan. 2022. Webshop: Towards scalable real-
nis Antonoglou, Aja Huang, Arthur Guez, Thomas
world web interaction with grounded language agents.
Hubert, Lucas Baker, Matthew Lai, Adrian Bolton,
et al. 2017. Mastering the game of go without human Dani Yogatama, Cyprien de Masson d’Autume, and Ling-
knowledge. nature, 550(7676):354–359. peng Kong. 2021. Adaptive semiparametric language
models. Transactions of the Association for Computa-
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M.
tional Linguistics, 9:362–373.
Ziegler, Ryan J. Lowe, Chelsea Voss, Alec Radford,
Dario Amodei, and Paul Christiano. 2020. Learn-
Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao
ing to summarize from human feedback. ArXiv,
Liu, Yuxiao Dong, and Jie Tang. 2023. Agenttuning:
abs/2009.01325.
Enabling generalized agent abilities for llms. arXiv
Yashar Talebirad and Amirhossein Nadiri. 2023. Multi- preprint arXiv:2310.12823.
agent collaboration: Harnessing the power of intelli-
gent llm agents. arXiv preprint arXiv:2306.03314. Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei
Lu. 2024. Tinyllama: An open-source small language
V. Talwar, Qinyi Wu, C. Pu, W. Yan, G. Jung, and D. Milo- model. arXiv preprint arXiv:2401.02385.
jicic. 2005. Comparison of approaches to service de-
ployment. 25th IEEE International Conference on Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Distributed Computing Systems (ICDCS’05), pages Artetxe, Moya Chen, Shuohui Chen, Christopher De-
543–552. wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022.
Opt: Open pre-trained transformer language models.
Meiling Tao, Xuechen Liang, Tianyu Shi, Lei Yu, and Yit- arXiv preprint arXiv:2205.01068.
ing Xie. 2023. Rolecraft-glm: Advancing personalized
role-playing in large language models. arXiv preprint Wentao Zhang, Yang Liu, Jianquan Lu, and Jinde Cao.
arXiv:2401.09432. 2017. A novel consensus algorithm for second-order
multi-agent systems without velocity measurements.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- International Journal of Robust and Nonlinear Control,
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- 27:2510 – 2528.
lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos-
ale, et al. 2023. Llama 2: Open foundation and fine- Kai Zhao, Yongduan Song, CL Philip Chen, and Long
tuned chat models. arXiv preprint arXiv:2307.09288. Chen. 2021. Adaptive asymptotic tracking with global
performance for nonlinear systems with unknown con-
Lewis Tunstall, Edward Beeching, Nathan Lambert, trol directions. IEEE Transactions on Automatic Con-
Nazneen Rajani, Kashif Rasul, Younes Belkada, trol, 67(3):1566–1573.
A Implementation Settings LLMs in automating database access through
T5QL, a new SQL generation method. Ad-
In this paper, we describe an experiment conducted ditionally, utilizing fine-tuned LLMs (such as
using the Low-Rank Adaptation (LoRA) fine-tuning GPT-3.5) to extract and link complex scientific
method to enhance the performance of various mod- information from scientific texts has demon-
els (Wang et al., 2005). The accuracy of the LoRA strated the capacity of LLMs to obtain struc-
method is of paramount importance in dealing with tured knowledge from unstructured text and
personalized and emotionally rich content. It en- subsequently construct large databases (Dunn
ables the models to adapt to new data features while et al., 2022).
maintaining their core capabilities (Bai et al., 2015).
During the experiment, we set the temperature (3) WebShop represents an innovative simulation
parameter of the test models to 0.7 to increase the of an e-commerce website environment, featur-
diversity of the content, and adjusted the top-p value ing 1.18 million real-world products and 12,087
to 0.95 to improve the precision of the generated con- crowd-sourced text instructions. This platform
tent. We employed a learning rate of 2e-4 and beta challenges agents to navigate through multiple
values of (0.9, 0.999) to ensure the stability of the types of webpages and perform a variety of
training process. The batch size was set to 4, with actions to find, customize, and purchase prod-
gradient accumulation, to ensure efficiency within ucts according to given instructions. WebShop
the limits of computational resources. To balance presents several challenges, including under-
innovation and coherence, we used LoRA parame- standing compositional instructions, query (re-
ters with a rank of 8 and an alpha value of 32, and )formulation, dealing with noisy text in web-
adjusted both the top-p value and the temperature pages, and conducting strategic exploration.
parameter to 0.7.These adjustments significantly en-
hanced the models’ flexibility and accuracy in han- (4) Knowledge Graphs The utilization of LLMs
dling personalized and emotionally rich content. in constructing and interacting with knowledge
graphs (KG) offers a promising avenue for en-
B Evaluation criteria hancing semantic understanding and informa-
tion retrieval. This involves assessing the mod-
(1) Operating systems Integrating LLMs into oper- els’ ability to not only generate but also inter-
ating systems offers vast potential for automat- pret complex interrelations within data, facilitat-
ing and optimizing tasks. This integration de- ing more intuitive and context-aware responses.
mands a secure, user-friendly interface for ef- The effectiveness of LLMs in this domain could
fective LLM-OS interaction and requires LLMs significantly improve AI’s capacity for reason-
to accurately understand the OS context for ing and decision-making based on structured
informed operations. Ensuring the safety of knowledge.
these operations is paramount to prevent mis-
use. Moreover, the system must handle errors (5) Mind2Web Mind2Web (M2W) is a dataset for
and provide clear feedback to users, enhanc- developing web agents that perform complex
ing interaction and control. Addressing these tasks on real websites via language instructions.
aspects can revolutionize computer interaction It features over 2,000 tasks across 137 sites from
and efficiency across industries. 31 domains. M2W’s real web environments and
diverse user interactions make it a crucial plat-
(2) Database Database (DB). Due to the crucial form for advancing AI navigation capabilities.
and challenging nature of database analysis
in many daily affairs, it is paramount to ex- (6) ALFWorld bridges interactive TextWorld en-
amine the abilities of LLMs to operate on vironments with embodied tasks from the AL-
real databases via SQL. Previous research has FRED dataset, enabling agents to learn abstract
placed significant emphasis on individual proce- strategies and apply them to real-world tasks. It
dures, such as showcasing the effectiveness of facilitates abstract reasoning and concrete exe-
Table 4: Distribution of various execution results across six tasks. (CLE: Exceeded Context Limit, TLE: Surpassed
Task Limit). Task limits exceeded are the main reason for incomplete tasks, pointing to limitations in LLM agents’
reasoning and decision-making within constrained timeframes.

OS DB KG ALF WS W2M
Completed 84.7 84.0 25.0 2.0 93.5 57.0
CLE 0.0 0.0 0.0 0.0 0.0 0.0
Invalid Format 0.0 3.0 0.0 0.0 0.0 0.0
Invalid Action 0.0 0.0 0.0 96.0 0.0 8.0
TLE 15.3 13.0 75.0 2.0 6.5 35.0

cution, allowing agents to plan actions in a text- performance tends to be limited by the quality and
based simulator and then execute these tasks in scope of its training data, particularly when faced
a visual environment. This approach enhances with novel or complex error types. This finding un-
agent generalization and problem-solving skills derscores the importance of high-quality prompts in
across various domains, such as language under- designing effective NLP models for complex tasks
standing and visual navigation, by leveraging like code correction. High-quality prompts not only
a modular design that simplifies research im- guide the model to more accurately understand and
provements. address specific programming issues but also reduce
noise in the processing, thereby improving overall
efficiency and accuracy.
C Details about Experiment
D Prompts for Tasks
In our research, we focused on exploring the impact
of meticulously crafted prompts on the performance We describe the task to the checker agent using the
of Natural Language Processing (NLP) models in following instruction as shown in Table 6 ∼ 12.
code correction tasks as shown in Table 5. By ex-
amining prompts of varying quality—high-quality, E Examples of Tasks
low-quality, and no prompts—we aimed to reveal All tasks in the dataset are divided into six categories.
how these factors influence the model’s capability, es- For each category, we present an example of success-
pecially in identifying and correcting programming ful execution as shown in Table 13∼ 22.
errors. Our findings indicate that fine-tuned high-
quality prompts significantly enhance the model’s
ability to write and rigorously check code, enabling
it to produce high-quality code and conduct stricter
code inspections. As demonstrated in the Table 23∼
25 .In contrast, low-quality prompts might lead to the
model generating nonsensical and ineffective code,
and in situations where correct code is provided, the
model might even alter it to incorrect code. The de-
cline in model output quality in these instances is
primarily due to the vague and unclear descriptions
provided by the prompts and the lack of specific
guidance, which prevents the model from effectively
utilizing the knowledge it acquired during training to
focus on key error points and make accurate correc-
tion decisions. Moreover, our study also showed that
in the absence of any prompt guidance, the model’s
Table 5: Evaluation Metrics Results

Evaluation Method BLEU-4 ROUGE-1 ROUGE-2 ROUGE-L


prompt - High-quality 44.4 57.3 35.0 42.5
prompt - Low-quality 15.2 27.4 10.3 16.8
without prompts 26.8 47.2 30.2 26.7

Table 6: Prompt for DATABASE

DATABASE(DB)

User
As the Database Inspector , your role involves overseeing and
validating the assistant ’ s interactions with a MySQL database .
The objective is a seamless collaboration to fulfill a specified
task . The process is straightforward : the assistant will submit
SQL queries , and it ’ s your responsibility to examine the SQL
syntax and logic to confirm their accuracy in addressing the
given question .
The mission is to collaborate effectively . You leverage your
expertise , guiding the assistant through the task . If the
assistant deviates from the correct approach , you ’ re expected to
provide guidance , offering both your insight and the correct
solution . Documenting your thought process is essential .
During each review cycle , you ’ re given the option to act by
providing a SQL command or conclude with an answer . Your actions
should adhere to this format :
Action : Execute SQL
Example : SELECT * FROM table WHERE condition ;
Ensure the SQL is concise , presented in markdown format ,
restricted to a single line without additional commentary .
Only one SQL statement should be executed at a time .
The assistant will execute only the first SQL code block you provide
and then share the output . Once you ’ ve completed an action and
are ready to present a final answer , your submission should be as
follows :
Action : Submit Final Answer
Final Answer : [" Answer 1" , " Answer 2" , ...]
Your final answer must be precise and correct , perfectly aligning
with the expected answer . If the task involves modifying the
database , the answer field might vary post - operation . Nonetheless
, any departure from the specified response format will lead to
an immediate failure of the task .
It ’s important to remember that you will be presented with the raw
MySQL response to analyze independently . Accuracy and correctness
are crucial in this joint effort .
Table 7: Prompt for KNOWLEDGE GRAPH

KNOWLEDGE GRAPH(KG)

User
As a reviewer , your task is to verify that the system for answering
questions based on a knowledge base ( KB ) operates correctly . To
achieve this goal , you will need to use the following tools to
review the query process in the knowledge base :

1. Verify Relation Retrieval ( get_relations ( variable : var ) -> list


of relations )
Confirm whether the system can correctly return a list of direct
relations associated with a specified variable . The variable
can be either a single entity or a set of entities ( i . e . , the
result of a previous query ) . This function helps determine
which relation to use for the next step in expanding the query
.
Example : Verify if ‘ get_relations ( Barack Obama ) ‘ can find all
relations / edges starting from the entity Barack Obama .
Note : The argument for ‘ get_relations ‘ must be a clearly defined
entity or a variable obtained from a previous query ( such as
#0) .

2. Verify Neighbor Retrieval ( get_neighbors ( variable : var , relation :


str ) -> variable )
Validate whether the system can return all entities connected to
the given variable via the specified relation . Note that ‘
get_neighbors () ‘ can only be used after ‘ get_relations () ‘ is
used to find a set of viable relations .
Example : Verify if ‘ get_neighbors ( Barack Obama , people . person .
profession ) ‘ correctly returns Obama ’ s profession in Freebase .

3. Verify Intersection Calculation ( intersection ( variable1 : var ,


variable2 : var ) -> variable )
Confirm whether the system can correctly calculate the
intersection of two variable sets and return the result .
Importantly , the two variable sets must be of the same type .

4. Verify Attribute Retrieval ( get_attributes ( variable : var ) -> list


of attributes )
Confirm whether the system can correctly find all numerical
attributes of the variable . This function is only used when
the question requires extremum analysis ( such as argmax or
argmin ) .
Table 8: Prompt for KNOWLEDGE GRAPH

KNOWLEDGE GRAPH(KG)

5. Verify Maximum Value Retrieval ( argmax ( variable : var , attribute :


str ) -> variable )
Validate whether the system can find the entity with the maximum
value of the specified attribute from a set of variables . This
operation requires using ‘ get_attributes () ‘ first to obtain a
list of viable attributes .
Example : Verify if ‘ argmax ( variable , age ) ‘ correctly returns the
oldest entity in the variable set .

6. Verify Minimum Value Retrieval ( argmin ( variable : var , attribute :


str ) -> variable )
Similar to ‘ argmax ‘ , but needs to validate whether the system can
return the entity with the minimum attribute value .

7. Verify Count Function ( count ( variable : var ) -> int )


Validate whether the system can correctly return the number of
entities belonging to the variable set .

Throughout the review process , you need to ensure the correctness of


each step , thereby verifying the accuracy of the knowledge base .
Each variable is represented by an id starting from 0. Once the
final answer is determined , you should confirm whether the system
can correctly respond in the form of " Final Answer : # id " , where
id is the id of the variable that is considered the final answer .
For example , confirm if the system correctly responded with "
Final Answer : #3" when it determined #3 to be the final answer .

Your goal is to ensure the accuracy and logical consistency of the


knowledge base query process , to help improve system performance
and answer quality .
Table 9: Prompt for WEBSHOP

WEBSHOP(WS)

User
As the Shopping Experience Auditor , you are charged with the task of
verifying that all actions undertaken in our web shopping
simulation adhere to the given instructions and are executed
correctly .
Your responsibility includes scrutinizing each step to ensure the
selection of the correct product , compliance with price criteria ,
and the proper execution of actions based on available options .
Should any discrepancies arise , it ’ s within your purview to
identify them and recommend appropriate corrections .
You are invited to specify any particular interactions for
verification , and you will conduct a thorough assessment to
guarantee the precision of our shopping procedure .

Table 10: Prompt for ALFWORLD

ALFWORLD(ALF)

User
As an Interactive Environment Auditor , your task is to meticulously
review the actions taken by the intelligent agent in the
household environment , ensuring they comply with the given
instructions and the range of available actions .
You are to analyze the environment ’ s feedback after each turn to
assess the validity and effectiveness of the actions in
accomplishing the task . Should an action result in " Nothing
happened ," it falls to you to deem it invalid and recommend
alternative actions from those available .
Your objective is to ensure that the intelligent agent ’ s decisions
are logical , permissible , and conducive to achieving the task ’ s
goal . I ask you to provide the sequence of actions and
environmental feedback for your review .
Table 11: Prompt for MIND2WEB

MIND2WEB(M2W)

User

As the Web Browsing Quality Supervisor , your role is to evaluate the


agent ’ s attempt at completing the following task and assess
whether the chosen action aligns with the HTML webpage and task
description :

Task Description :
[ Insert specific task description here ]

Previous actions taken by the agent :


[ List of previous actions ]

Agent ’s proposed next action :


Element : [ Description of the selected element ]
Action : [ Type of action ]
Value : [ Value for input or selection , if the action is not a click ]

Please examine the HTML element and the task description . Determine
if the agent ’ s proposed action accurately fulfills the task
requirements .
Provide feedback on the appropriateness of the action . If it
deviates from what is required , kindly specify the correct action
that should be taken .
Table 12: Prompt for OPERATING SYSTEM

OPERATING SYSTEM(OS)

User
As the System Operation Audit Expert , your role is to evaluate the
assistant ’ s actions in the context of a simulated interaction
with a Linux ( Ubuntu ) operating system .
The assistant will perform tasks based on your requests and provide
explanations for their actions . For each round , review the
assistant ’ s thought process and the action taken , and determine
if the task has been correctly completed .
The assistant can take one of three actions :
1. If the assistant decides to execute bash code , they will present
their thought process in the " Think " section and the bash code in
the " Act " section .
2. If the assistant believes the task is finished , they will
indicate completion in the " Think " section and provide a
justification for why the task is considered complete .
3. If the assistant has an answer to a question , they will explain
how they arrived at the answer in the " Think " section and provide
the answer in the " Act " section .
Your task is to assess the assistant ’ s performance based on the
following criteria :
Accuracy : Is the action taken appropriate for the task described ?
Completeness : Has the assistant provided all necessary information
and performed all required steps ?
Justification : Is the thought process logical and does it support
the action taken ?
Please provide feedback on the assistant ’ s actions , including any
errors or omissions , and suggest corrections if necessary . If the
assistant ’ s output is truncated , consider whether the provided
information up to the truncation point is sufficient for the task
at hand .

Begin your evaluation with the following prompt :


Please evaluate the assistant ’ s last action and thought process . Was
the task performed correctly ? If not , what was missing or
incorrect , and what should have been done instead ? Provide
specific feedback based on the assistant ’ s " Think " and " Act "
sections .
Table 13: Example for OPERATING SYSTEM

OPERATING SYSTEM(OS)

BASH >
ls / etc
User : The output of the OS : cpi cron . hourly fuse . conf iproute2 lvm
networkd - dispatcher protocols selinux tmpfiles . d [ truncated
because the output is too long ]

Agent : Think : The output has been truncated because it is too long ,
so I need to count files by script directly .
Act : bash
<BASH >
ls -1 / etc | wc -l

User : The output of the OS : 220

Agent : Think : Now I get the answer , it is 220.


Act : answer (220)

User : Now , I will start a new problem in a new OS . My problem is : In


this problem , you need to find the total number of lines
containing the word " Linux " in all the ". txt " files inside a
given directory and its subdirectories .
Here ’s the problem detail : A new directory named " Linux_files " will
be created in your home directory . The " Linux_files " directory
will contain multiple ". txt " files and subdirectories . Some of
the subdirectories inside " Linux_files " will also contain ". txt "
files . Some of the ". txt " files will contain the word " Linux " in
one or more lines .
Your task is to find the total number of lines containing the word "
Linux " in all the ". txt " files inside the " Linux_files " directory
and its subdirectories . Submit your answer as an integer .

Agent : Think : To find the total number of lines containing the word
" Linux " in all the ". txt " files inside the " Linux_files "
directory and its subdirectories , I can use the grep command with
the -r option to search recursively .
Act : bash
<BASH >
grep -r " Linux " ~/ Linux_files /*. txt | wc -l

User : The output of the OS : 4

Agent : Think : The output shows that there are 4 lines containing the
word " Linux " in all the ". txt " files inside the " Linux_files "
directory and its subdirectories .
Act : answer (4)
Table 14: Example for KNOWLEDGE GRAPH

KNOWLEDGE GRAPH(KG)

Agent : Thought : Now I need to find the city where Busch Stadium is
located . To achieve this , I will query the KB to find relations
connected to the variable #1.
Action : get_relations (#1)

User : Observation : [ List of relations including location . location .


containedby , location . location . adjoin_s , location . location .
contains , and many others ]

Agent : Thought : From the above relations , I might use location .


location . containedby to find the location of Busch Stadium .
Action : get_neighbors ( Busch Stadium , location . location . containedby )

User : Observation : variable #2 , which are instances of location .


location

Agent : Thought : Now I need to find the city where Busch Stadium is
located . To achieve this , I will query the KB to find relations
connected to the variable #2.
Action : get_relations (#2)

User : Observation : [ List of relations similar to the first list ]

Agent : Thought : From the above relations , I might use location .


location . containedby to find the location of Busch Stadium .
Action : get_neighbors ( Busch Stadium , location . location . containedby )

User : Observation : variable #3 , which are instances of location .


location

Agent : Thought : The Busch Stadium is located in St . Louis , Missouri ,


USA . Now I need to find the city where the stadium is located .
To achieve this , I will query the KB to find relations connected
to the variable #3.
Action : get_relations (#3)
Table 15: Example for KNOWLEDGE GRAPH

KNOWLEDGE GRAPH(KG)

User : Observation : [ List of relations similar to the previous lists ]

Agent : Thought : From the above relations , I might use location .


location . containedby to find the location of Busch Stadium .
Action : get_neighbors ( Busch Stadium , location . location . containedby )

User : Observation : variable #4 , which are instances of location .


location

Agent : Thought : The Busch Stadium is located in St . Louis , Missouri ,


USA . Now I need to find the city where the stadium is located .
To achieve this , I will query the KB to find relations connected
to the variable #4.
Action : get_relations (#4)

User : Observation : [ List of relations similar to the previous lists ]

Agent : Thought : From the above relations , I might use location .


location . containedby to find the location of Busch Stadium .
Action : get_neighbors ( Busch Stadium , location . location . containedby )

User : Observation : variable #5 , which are instances of location .


location

Agent : Thought : The Busch Stadium is located in St . Louis , Missouri ,


USA . Now I need to find the city where the stadium is located .
To achieve this , I will query the KB to find relations connected
to the variable #5.
Action : get_relations (#5)

User : Observation : [ List of relations similar to the previous lists ]


Table 16: Example for ALFWORLD

ALFWORLD(ALF)

Agent Thought : " The task is to examine the mug with the desklamp . I
need to find where the mug is first and then find the desklamp to
light it up . I ’ ll start my search on desk 1."

Agent Action : " go to desk 1"


Environment Feedback : " On the desk 1 , you see a bowl 1 , a desklamp
1, a laptop 1 , and a pen 1."

Agent Action : " examine bowl 1 with desklamp 1"


Environment Feedback : " You pick up the bowl 1 from the desk 1."

Agent Action : " put bowl 1 in / on desk 1"


Environment Feedback : " You put the bowl 1 in / on the desk 1."

Agent Thought : " The task is to examine the mug with the desklamp . I
need to find where the mug is first and then find the desklamp to
light it up . I ’ ll start my search on desk 1."
Agent Action : " examine desk 1"
Environment Feedback : " On the desk 1 , you see a bowl 1 , a desklamp
1, a laptop 1 , and a pen 1."

Agent Action : " take mug 1 from desk 1"


Environment Feedback : " You pick up the bowl 1 from the desk 1."

Agent Action : " use desklamp 1"


Environment Feedback : " You turn on the desklamp 1."

Agent Action : " examine mug 1 with desklamp 1"


Environment Feedback : " You turn on the desklamp 1."
Table 17: Example for MIND2WEB

MIND2WEB(M2W)

User : Based on the HTML webpage above , try to complete the following
task :
Task : Find tuition & fees for full term graduate students in the
college of engineering for the academic year 2022 -2023
Previous actions :
[ link ] Costs & Financial Aid -> CLICK
[ combobox ] Academic Year -> SELECT : 2022 -2023
[ combobox ] College / School -> SELECT : College of Engineering
[ combobox ] Full or Half Term -> SELECT : Full Term
[ combobox ] Level of Study -> SELECT : Graduate
What should be the next action ? Please select from the following
choices ( If the correct action is not in the page above , please
select A . ’ None of the above ’) :
A. None of the above
B. <a id =0 > Graduation </a >
C. < select id =1 academic_year > < option all > Select </ option > < option
175 true >
D. < select id =2 college_school > < option all > Select </ option > <
option 17 > College
E. < select id =3 level_of_study > < option all > Select </ option > <
option 38 true >
F. < input id =4 submit apply / >
Agent : Thought : The search has already been set to College of
Engineering and Graduate level . Next , I should select Full Term .
Answer : E .
Action : SELECT
User : Based on the HTML webpage above , try to complete the following
task :
Task : Find tuition & fees for full term graduate students in the
college of engineering for the academic year 2022 -2023
Previous actions :
[ link ] Costs & Financial Aid -> CLICK
[ combobox ] Academic Year -> SELECT : 2022 -2023
[ combobox ] College / School -> SELECT : College of Engineering
[ combobox ] Full or Half Term -> SELECT : Full Term
[ combobox ] Level of Study -> SELECT : Graduate
What should be the next action ? Please select from the following
choices ( If the correct action is not in the page above , please
select A . ’ None of the above ’) :
A. None of the above
B. < input id =0 text search / >
C. <a id =1 > Tuition </a >
D. < div id =2 > Tuition & Fees </ div >
E. < span id =3 > Tuition & Fees </ span >
F. <a id =4 > Class Fees </a >
Table 18: Example for MIND2WEB

MIND2WEB(M2W)

Agent : Thought : The search has already been set to Tuition . Next , I
should find the tuition & fees for full term graduate students in
the college of engineering for the academic year 2022 -2023.
Answer : D .
Action : CLICK on the link with the text " Tuition & Fees "
User : Based on the HTML webpage above , try to complete the following
task :
Task : Find tuition & fees for full term graduate students in the
college of engineering for the academic year 2022 -2023
Previous actions :
[ link ] Costs & Financial Aid -> CLICK
[ combobox ] Academic Year -> SELECT : 2022 -2023
[ combobox ] College / School -> SELECT : College of Engineering
[ combobox ] Full or Half Term -> SELECT : Full Term
[ combobox ] Level of Study -> SELECT : Graduate
What should be the next action ? Please select from the following
choices ( If the correct action is not in the page above , please
select A . ’ None of the above ’) :
A. None of the above
B. < div id =0 > Tuition & Fees </ div >
C. < select id =1 level_of_study > < option all > Select </ option > <
option 38 true >
Agent : Thought : The search has already been set to graduate students
in the college of engineering for the academic year 2022 -2023.
Next , I should select the level of study .
Answer : C .
Action : SELECT : Graduate
Table 19: Example for WEBSHOP

WEBSHOP(WS)

User : You are web shopping . I will give you instructions about what
to do . You have to follow the instructions . Every round I will
give you an observation and a list of available actions , you have
to respond with an action based on the state and instruction .
You can use the search action if the search is available . You can
click one of the buttons in clickables . An action should be of
the following structure : search [ keywords ] or click [ value ]. If the
action is not valid , perform nothing . Keywords in search are up
to you , but the value in click must be a value in the list of
available actions . Remember that your keywords in search should
be carefully designed . Your response should use the following
format :

Thought :
I think ...

Action :
click [ something ]

Agent : Ok .

User : Observation :
" WebShop [ SEP ] Instruction : [ SEP ] I ’ m looking for rubber stole shoes
for light wearing it was brown in color , and price lower than
70.00 dollars [ SEP ] Search "

Available Actions :
{" has_search_bar ": true , " clickables ": [" search "]}

Agent : Thought :
I think I should use the search bar to look for the product I need .

Action :
search [ rubber stole shoes brown price < 70.00]

User : Observation :
Instruction : [ SEP ] I ’ m looking for rubber stole shoes for light
wearing it was brown in color , and price lower than 70.00 dollars
[ SEP ] Back to Search [ SEP ] Page 1 ( Total results : 50) [ SEP ] Next
> [ SEP ] B08XHSG67V [ SEP ] Crosley Furniture Madeleine 53.25"
Farmhouse Metal Kitchen Island in Brown / Black [ SEP ] 331.49 [ SEP ]
B07CZ5YKWW [ SEP ] Roper Tumbled Womens Brown Leather Petty Slip - On
Shoes [...]
Table 20: Example for WEBSHOP

WEBSHOP(WS)

[...]
Available Actions :
{" has_search_bar ": false , " clickables ": [" back to search " , " next >" ,
" b08xhsg67v " , " b07cz5ykww " , " b000jiatqs " , " b01ld5a59g " , "
b085wr7626 " , " b07s7qxncw " , " b01lykp60o " , " b004xj62re " , "
b07fy9mfhx " , " b09cfqjt7h "]}

Agent : Thought :
I think I should click on the product I need , which is B08XHSG67V .

Action :
click [ B08XHSG67V ]

User : Observation :
Instruction : [ SEP ] I ’ m looking for rubber stole shoes for light
wearing it was brown in color , and price lower than 70.00 dollars
[ SEP ] Back to Search [ SEP ] < Prev [ SEP ] Crosley Furniture
Madeleine 53.25" Farmhouse Metal Kitchen Island in Brown / Black [
SEP ] Price : $331 .49 [ SEP ] Rating : N . A . [ SEP ] Description [ SEP ]
Features [ SEP ] Reviews [ SEP ] Buy Now

Available Actions :
{" has_search_bar ": false , " clickables ": [" back to search " , " < prev " ,
" description " , " features " , " reviews " , " buy now "]}

Agent : Thought :
I think I should click on the ’ Buy Now ’ button to purchase the
product .

Action :
click [ Buy Now ]
Table 21: Example for DATABASE

DATABASE(DB)

User : Never forget you are a Database Operator and I am a Operation


Supervisor . Never flip roles ! Never instruct me ! We share a
common interest in collaborating to successfully complete a task .
You must help me to complete the task . Here is the task : I will
ask you a question , then you should help me operate a MySQL
database with SQL to answer the question . Never forget our task ! I
must instruct you based on your expertise and my needs to
complete the task . You have to explain the problem and your
solution to me and write down your thoughts . After thinking and
explaining thoroughly , every round you can choose to operate or
to answer . your operation should be like this :

Action : Operation
‘‘‘ sql
SELECT * FROM table WHERE condition ;
‘‘‘

You MUST put SQL in markdown format without any other comments . Your
SQL should be in one line . Every time you can only execute one
SQL statement . I will only execute the statement in the first SQL
code block . Every time you write a SQL , I will execute it for
you and give you the output . If you are done operating , and you
want to commit your final answer , then write down :

Action : Answer
Final Answer : [" ANSWER1 " , " ANSWER2 " , ...]

DO NOT write this pattern unless you are sure about your answer . I
expect an accurate and correct answer . Please note : Your answer
should be accurate . Your answer must be exactly the same as the
correct answer . If the question is about modifying the database ,
then after done operation , your answer field can be anything . If
your response cannot match any pattern I mentioned earlier , you
will be judged as FAIL immediately . Your input will be raw MySQL
response , you have to deal with it by yourself .
Table 22: Example for DATABASE

DATABASE(DB)

Agent : Ok

User : What event did he fight masato shiozawa ?\ nThe name of this
table is MMA Fight Record , and the headers of this table are Res ,
Record , Opponent , Method , Event , Round .

agent : To find the event that Masato Shiozawa and I fought , I can
use a SELECT statement to retrieve the ‘ Event ‘ column from the ‘
MMA Fight Record ‘ table . To specify the condition that Masato
Shiozawa was the opponent , I can use a WHERE clause with the
condition ‘ Opponent = ’ Masato Shiozawa ’ ‘.
Action : Operation
‘‘‘ sql
SELECT ‘ Event ‘ FROM ‘ MMA Fight Record ‘ WHERE ‘ Opponent ‘ = ’ Masato
Shiozawa ’
‘‘‘

User : [( ’ Shooto 2003 - 5/4 in Korakuen Hall ’ ,) ]

agent : The SQL query executed successfully and returned the expected
answer . The event that Masato Shiozawa and I fought was ’ Shooto
2003 - 5/4 in Korakuen Hall ’. I can now commit my answer .
Action : Answer
Final Answer : [" Shooto 2003 - 5/4 in Korakuen Hall "]
Table 23: prompt - High-quality

Q1 : How to install the required Python libraries for a Q & A game ?


A1 ( Assistant ) : Use the following command to install the ’
transformers ’ library :
‘‘‘ import subprocess
requirements = [ ’ transformers ’]
for package in requirements :
subprocess . run ([ ’ pip ’ , ’ install ’ , ’-U ’ , package ]) ‘‘‘
A1 ( Checker ) : The code is correct but consider using a requirements .
txt file for better package management .
Q2 : How to import the ’ transformers ’ library for a question -
answering task ?
A2 ( Assistant ) : Import the pipeline function to create a Q & A task
pipeline :
‘‘‘ from transformers import pipeline ‘‘‘
A2 ( Checker ) : Correct . This is the standard method for importing and
using the ’ transformers ’ pipeline .
Q3 : How to define a function to play a memory game that checks if
the user ’ s answer is correct ?
A3 ( Assistant ) : Define the ’ play_memory_game ’ function without
checking input types :
‘‘‘ def play_memory_game ( context , question , user_answer ) :
question_answerer = pipeline ( ’ question - answering ’)
result = question_answerer ( question = question , context =
context )
return user_answer == result [ ’ answer ’] ‘‘‘
A3 ( Checker ) : You should validate input types to ensure they are
strings and handle exceptions . Use a specific model for better
accuracy :
‘‘‘ def play_memory_game ( context , question , user_answer ) :
if not all ( isinstance ( arg , str ) for arg in [ context ,
question , user_answer ]) :
raise ValueError ( ’ All arguments must be of type str ’)
question_answerer = pipeline ( ’ question - answering ’ , model = ’
distilbert - base - uncased - distilled - squad ’)
result = question_answerer ( question = question , context =
context )
return user_answer . lower () == result [ ’ answer ’]. lower () ‘‘‘
Q4 : How to call the ’ play_memory_game ’ function and check results ?
A4 ( Assistant ) : Call the function with incorrect syntax :
play_memory_game ( ’ The capital of France is Paris ’ , ’ What is the
capital of France ? ’ , ’ Paris ’)
A4 ( Checker ) : The syntax is almost correct , but you should capture
and print the function ’ s return value to check the result : ‘‘‘
result = play_memory_game ( ’ The capital of France is Paris ’ , ’ What
is the capital of France ? ’ , ’ Paris ’) print ( result ) ‘‘‘
Table 24: prompt - Low-quality

Q1 : How to install the required Python libraries for a Q & A game ?


A1 ( Assistant ) : Use the command ’ pip install transformers ’.
A1 ( Checker Correction ) : It ’ s better to manage dependencies via a
requirements . txt file . Use ’ pip install -r requirements . txt ’
where ’ transformers ’ is listed .
Q2 : How to import the ’ transformers ’ library for a question -
answering task ?
A2 ( Assistant ) : Use ’ from transformers import pipeline ’.
A2 ( Checker Correction ) : Correct . Ensure ’ transformers ’ is installed
beforehand .
Q3 : How to define a function to play a memory game that checks if
the user ’ s answer is correct ?
A3 ( Assistant ) : Provides a basic function without input type
checking .
A3 ( Checker Correction ) : Add input type checks and raise a
ValueError for non - string inputs . Use a specific model for
accuracy : ’ distilbert - base - uncased - distilled - squad ’. Q4 : How to
test the ’ play_memory_game ’ function ’ s correctness ?
A4 ( Assistant ) : Provides a test function without exception handling .
A4 ( Checker Correction ) : Modify the test function to catch
exceptions and provide detailed output for each test case for
easier debugging .
Table 25: without prompts

Q1 : How to install the required Python libraries ?


A1 : Use the following command to install the ’ transformers ’ library :
import subprocess
requirements = [ ’ transformers ’]
for package in requirements :
subprocess . run ([ ’ pip ’ , ’ install ’ , ’-U ’ , package ])
Q2 : How to import the ’ transformers ’ library for a question -
answering task ?
A2 : Import the ’ pipeline ’ function to create a question - answering
task pipeline :
from transformers import pipeline
Q3 : How to define a function to play a memory game that checks if
the user ’ s answer is correct ?
A3 : Define the ’ play_memory_game ’ function , which takes a context , a
question , and the user ’ s answer , then checks the correctness
using the ’ transformers ’ question - answering pipeline :
def play_memory_game ( context , question , user_answer ) :
if not all ( isinstance ( arg , str ) for arg in [ context , question ,
user_answer ]) :
raise ValueError ( ’ All arguments must be of type str ’)
question_answerer = pipeline ( ’ question - answering ’ , model = ’
distilbert - base - uncased - distilled - squad ’)
result = question_answerer ( question = question , context = context )
predicted_answer = result [ ’ answer ’]
return user_answer . lower () == predicted_answer . lower ()
Q4 : How to test the ’ play_memory_game ’ function ?
A4 : The ’ test_play_memory_game ’ function includes three test cases :
normal execution , handling non - string input , and incorrect user
answer :
def test_play_memory_game () :
# Normal execution test
assert play_memory_game ( ’ The capital of France is Paris . ’ , ’ What
is the capital of France ? ’ , ’ Paris ’) , " Incorrect answer ."
# Non - string input test
try :
play_memory_game (123 , ’ What is the answer ? ’ , ’ test ’)
except ValueError :
pass # Expected failure for non - string input
# Incorrect answer test
assert not play_memory_game ( ’ The Earth revolves around the Sun
. ’ , ’ What does the Moon revolve around ? ’ , ’Sun ’) , " Incorrect
answer should fail ."

You might also like