Autonomous AI Agents
Autonomous AI Agents
Abstract—Large language models and autonomous AI agents employing reflection, planning, and multi-agent collaboration
have evolved rapidly, resulting in a diverse array of evaluation has given rise to Agentic RAG systems, which dynamically
benchmarks, frameworks and collaboration protocols. However, orchestrate information retrieval and iterative refinement to
the landscape remains fragmented and lacks a unified taxonomy
or comprehensive survey. Therefore, we present a side-by-side manage complex workflows effectively [11], [12].
comparison of benchmarks developed between 2019 and 2025 Recent advances in large language models have paved the
that evaluate these models and agents across multiple domains. way for highly autonomous AI systems that can independently
In addition, we propose a taxonomy of approximately 60 bench- handle complex research tasks. These systems, often referred
marks that cover general and academic knowledge reasoning, to as agentic AI, can generate hypotheses, conduct literature
mathematical problem-solving, code generation and software
engineering, factual grounding and retrieval, domain-specific reviews, design experiments, analyze data, accelerate scientific
evaluations, multimodal and embodied tasks, task orchestration, discovery, and reduce research costs [13], [14], [15], [16].
and interactive assessments. Furthermore, we review AI-agent Several frameworks, such as LitSearch, ResearchArena, and
frameworks introduced between 2023 and 2025 that integrate Agent Laboratory, have been developed to automate various
large language models with modular toolkits to enable au- research tasks, including citation management and academic
tonomous decision-making and multi-step reasoning. Moreover,
we present real-world applications of autonomous AI agents survey generation [17], [18], [19]. However, challenges per-
in materials science, biomedical research, academic ideation, sist, especially in executing domain-specific literature reviews
software engineering, synthetic data generation, chemical rea- and ensuring the reproducibility and reliability of automated
soning, mathematical problem-solving, geographic information processes [20], [21]. Parallel to these developments in research
systems, multimedia, healthcare, and finance. We then survey automation, large language model-based agents have also
key agent-to-agent collaboration protocols, namely the Agent
Communication Protocol (ACP), the Model Context Protocol begun to transform the medical field [22]. These agents are in-
(MCP), and the Agent-to-Agent Protocol (A2A). Finally, we creasingly used for diagnostic support, patient communication,
discuss recommendations for future research, focusing on ad- and medical education by integrating clinical guidelines, med-
vanced reasoning strategies, failure modes in multi-agent LLM ical knowledge bases, and healthcare systems. Despite their
systems, automated scientific discovery, dynamic tool integration promise, these applications face significant hurdles, including
via reinforcement learning, integrated search capabilities, and
security vulnerabilities in agent protocols. concerns over reliability, reproducibility, ethical governance,
and safety [23], [24], [25]. Addressing these issues is crucial
Index Terms—Large Language Models, Autonomous AI for ensuring that LLM-based agents can be effectively and
Agents, Agentic AI, Reasoning, Benchmarks.
responsibly incorporated into clinical practice, underscoring
the need for comprehensive evaluation frameworks that can
I. I NTRODUCTION reliably measure their performance across various healthcare
Large Language Models (LLMs) such as OpenAI’s GPT- tasks [26], [27], [28].
4 [1], Qwen2.5-Omni [2], DeepSeek-R1 [3], and Meta’s LLM-based agents are emerging as a promising frontier in
LLaMA [4] have transformed AI by enabling human-like text AI, combining reasoning and action to interact with complex
generation and advanced natural language processing, spurring digital environments [29], [30]. Therefore, various approaches
innovation in conversational agents, automated content cre- have been explored to enhance LLM-based agents, from
ation, and real-time translation [5]. Recent enhancements have combining reasoning and acting using techniques like React
extended their utility to multimodal tasks, including text-to- [31] and Monte Carlo Tree Search [32] to synthesizing high-
image and text-to-video generation that broaden the scope quality data with methods like Learn-by-Interact [33], which
of generative AI applications [6]. However, their dependence sidestep assumptions such as state reversals. Other strategies
on static pre-training data can lead to outdated outputs and involve training on human-labeled or GPT-4 distilled data with
hallucinated responses [7], [8], a limitation that Retrieval- systems like AgentGen [34] and AgentTuning [35] to generate
Augmented Generation (RAG) addresses by incorporating trajectory data. At the same time, reinforcement learning
real-time data from knowledge bases, APIs, or the web [9], methods utilize offline algorithms and iterative refinement
[10]. Building on this, the evolution of intelligent agents through reward models and feedback to enhance efficiency
2
and performance in realistic environments [36], [37]. efforts across multiple domains. In this section, we review
LLM-based Multi-Agents harness the collective intelligence the most relevant studies that investigate the integration of
of multiple specialized agents, enabling advanced capabilities LLM-based agents into software engineering, propose agent
over single-agent systems by simulating complex real-world architectures and evaluation frameworks, explore the devel-
environments through collaborative planning, discussion, and opment of multi-agent systems, and examine domain-specific
decision-making. This approach leverages the communicative applications, including healthcare, game-theoretic scenarios,
strengths and domain-specific expertise of LLMs, allowing GUI interactions, personal assistance, scientific discovery, and
distinct agents to interact effectively, much like human teams chemistry.
tackling problem-solving tasks [38], [39]. Recent research
highlights promising applications across various fields, includ- A. LLM-based Agents in Software Engineering
ing software development [40], [41], multi-robot systems [42],
Wang et al. [47] present a survey that bridges Large Lan-
[43], society simulation [44], policy simulation [45], and game
guage Model (LLM)-based agent technologies with software
simulation [46].
engineering (SE). It highlights how LLMs have achieved
The main contributions of this study are:
significant success in various domains and have been in-
• We present a comparative table of benchmarks developed tegrated into SE tasks, often under the agent paradigm,
between 2019 and 2025 that rigorously evaluate large lan- whether explicitly or implicitly. The study presents a structured
guage models and autonomous AI agents across multiple framework for LLM-based agents in SE, comprising three
domains. primary modules: perception, memory, and action. Jin et al.
• We propose a taxonomy of approximately 60 LLM [48] investigate the use of large language models (LLMs)
and AI-agent benchmarks, including general and aca- and LLM-based agents in software engineering, distinguishing
demic knowledge reasoning, mathematical problem solv- between the traditional capabilities of LLMs and the enhanced
ing, code generation and software engineering, fac- functionalities offered by autonomous agents. It highlights the
tual grounding and retrieval, domain-specific evaluations, significant success of LLMs in tasks such as code genera-
multimodal and embodied tasks, task orchestration, and tion and vulnerability detection, while also addressing their
interactive and agentic assessments. limitations, specifically the issues of autonomy and self-
• We present prominent AI-agent frameworks from 2023 improvement that LLM-based agents aim to overcome. The
to 2025 that integrate large language models with mod- paper provides an extensive review of current practices across
ular toolkits, enabling autonomous decision-making and six key domains: requirement engineering, code generation,
multi-step reasoning. autonomous decision-making, software design, test generation,
• We provide applications of autonomous AI agents in and software maintenance. In a complementary study, Jin et
various fields, including materials science and biomedical al. [48] investigate the use of large language models (LLMs)
research, academic ideation and software engineering, and LLM-based agents in software engineering, distinguishing
synthetic data generation and chemical reasoning, mathe- between the traditional capabilities of LLMs and the enhanced
matical problem-solving and geographic information sys- functionalities offered by autonomous agents. It highlights
tems, as well as multimedia, healthcare, and finance. the significant success of LLMs in tasks such as code gen-
• We survey agent-to-agent collaboration protocols, namely eration and vulnerability detection, while also addressing
the Agent Communication Protocol (ACP), the Model their limitations, specifically, issues of autonomy and self-
Context Protocol (MCP), and the Agent-to-Agent Pro- improvement that LLM-based agents aim to overcome. The
tocol (A2A). paper provides an extensive review of current practices across
• We outline recommendations for future research on au- six key domains: requirement engineering, code generation,
tonomous AI agents, specifically advanced reasoning autonomous decision-making, software design, test generation,
strategies, failure modes in multi-agent large language and software maintenance.
model (LLM) systems, automated scientific discovery,
dynamic tool integration via reinforcement learning, inte-
B. Agent Architectures and Evaluation Frameworks
grated search capabilities, and security vulnerabilities in
agent protocols. Singh et al. [49] delves into Agentic Retrieval-Augmented
Generation (Agentic RAG), a sophisticated evolution of tra-
Fig. 1 illustrates the structure of this survey. Section II
ditional Retrieval-Augmented Generation systems that en-
presents the related works. Section III provides a side-by-
hances the capabilities of large language models (LLMs).
side tabular comparison of state-of-the-art LLM and Agentic
While LLMs have transformed AI through human-like text
AI benchmarks. Section IV reviews AI agent frameworks, AI
generation and language understanding, their dependence on
agent applications, AI agent protocols, and training datasets
static training data often results in outdated or imprecise
across various domains. Section V highlights several critical
responses. The paper addresses these limitations by embed-
research directions. Finally, Section VI concludes the paper.
ding autonomous agents within the RAG framework, enabling
dynamic, real-time data retrieval and adaptive workflows.
II. R ELATED W ORKS It details how agentic design patterns such as reflection,
The growing field of autonomous AI agents powered by planning, tool utilization, and multi-agent collaboration equip
large language models has inspired a wide range of research these systems to manage complex tasks and support multi-step
3
How have recent advancements in LLMs What are the related surveys in the field What are the key LLM benchmarks
and agentic AI impacted autonomous AI of LLM-based agents and autonomous AI developed between 2019 and 2025 for
systems, and what are the main systems? evaluating large language models and
contributions of this study? agentic AI systems across various
domains?
LLM-based Agents in Software
Recent advancements in LLMs Engineering
ComplexFuncBench
MMLU benchmark
benchmark
Agent Architectures and Evaluation
Agentic AI Frameworks Humanity's Last
FACTS Grounding
Exam (HLE)
benchmark
benchmark
Multi-Agent Domain-Specific
Collaborative Multi-Agent Systems
Systems Applications ProcessBench OmniDocBench
benchmark Benchmark
What are the key AI agent frameworks and What are the key challenges and open
What are the key conclusions and future
applications developed between 2024 and problems in advancing AI agents and
directions for large language models
2025 for achieving autonomous decision- large language models?
(LLMs) and autonomous AI agents?
making and dynamic reasoning in real-
world tasks?
AI Agents Reasoning
Key conclusions
AI Agent frameworks Why Do Multi-Agent LLM Systems
Fail?
reasoning. The survey offers a comprehensive taxonomy of Moreover, the paper critically highlights existing deficiencies
Agentic RAG architectures, highlights key applications across in the field, notably the need for metrics that more effectively
various sectors, including healthcare, finance, and education, capture cost efficiency, safety, and robustness. In doing so,
and outlines practical implementation strategies. it maps the current landscape of agent evaluation and sets
forth compelling directions for future inquiry, underscoring the
Complementing this architectural perspective, Yehudai et importance of scalable and fine-grained evaluation techniques
al. [50] mark a significant milestone in artificial intelligence in the rapidly evolving AI domain.
by surveying evaluation methodologies for agents powered
by large language models (LLMs). It thoroughly reviews Similarly, Chen et al. [51] focus on Role-Playing Agents
the capabilities of these agents, focusing on core functions (RPAs), a growing class of LLM-based agents that mimic
such as planning, tool utilization, self-reflection, and mem- human behavior across various tasks. Recognizing the inherent
ory, while assessing specialized applications ranging from challenges in evaluating such diverse systems, the authors sys-
web interactions to software engineering and conversational tematically reviewed 1,676 papers published between January
tasks. The authors uncover a clear trend toward developing 2021 and December 2024. Their extensive analysis identifies
more rigorous, dynamically updated evaluation frameworks six key agent attributes, seven task attributes, and seven
by examining both targeted benchmarks for domain-specific evaluation metrics that are prevalent in the current literature.
applications and those designed for more generalist agents. Based on these insights, the paper proposes an evidence-based,
4
actionable, and generalizable evaluation guideline designed to enabled significant advances in complex problem-solving and
standardize the assessment of RPAs. world simulation. Key aspects of these systems are examined,
including the domains and environments they simulate, the
C. Multi-Agent Systems profiling and communication strategies employed by individ-
Yan et al. [52] provides a comprehensive survey on integrat- ual agents, and the mechanisms that underpin the enhancement
ing LLMs into multi-agent systems (MAS). Their work em- of their collective capacities.
phasizes the communication-centric aspects that enable agents
to engage in both cooperative and competitive interactions,
D. Domain-Specific Applications
thereby tackling tasks that are unmanageable for individual
agents. The paper examines system-level features, internal 1) Healthcare: Wang et al. [28] explores the transforma-
communication mechanisms, and challenges, including scala- tive impact of LLM-based agents on healthcare, presenting
bility, security, and multimodal integration. In a related study, a detailed review of their architectures, applications, and
Guo et al. [38] offer an extensive overview of LLM-based inherent challenges. It dissects the core components of medical
multi-agent systems, charting the evolution from single-agent agent systems, such as system profiles, clinical planning
decision-making to collaborative frameworks that enhance mechanisms, and medical reasoning frameworks, while also
collective problem-solving and world simulation. In a related discussing methods to enhance external capacities. Major
study, Guo et al. [38] provide an extensive overview of large application areas include clinical decision support, medical
language model (LLM)-based multi-agent systems, building documentation, training simulations, and overall healthcare
on the success of LLMs in autonomous planning and reason- service optimization. The survey further evaluates the per-
ing. The authors detail how the evolution from single-agent formance of these agents using established frameworks and
decision-making to collaborative multi-agent frameworks has metrics, identifying persistent challenges such as hallucination
5
management, multimodal integration, and ethical considera- used in the field, offering valuable insights into current prac-
tions. tices. Moreover, the paper critically addresses significant chal-
2) Social Agents in Game-Theoretic Scenarios: Feng et al. lenges, including automating comprehensive literature reviews,
[53] provide a review of research on LLM-based social agents ensuring system reliability, and addressing ethical concerns.
in game-theoretic scenarios. This area has gained prominence It outlines future research directions, emphasizing the im-
for assessing social intelligence in AI systems. The authors portance of human-AI collaboration and improved system
categorize the literature into three main components. First, the calibration.
game framework is examined, highlighting various choice- and 6) Chemistry: Ramos et al. [56] examine the transforma-
communication-focused scenarios. Second, the paper explores tive impact of large language models (LLMs) in chemistry,
the attributes of social agents, examining their preferences, focusing on their roles in molecule design, property prediction,
beliefs, and reasoning capabilities. Third, it discusses evalua- and synthesis optimization. It highlights how LLMs not only
tion protocols incorporating game-agnostic and game-specific accelerate scientific discovery through automation but also
metrics to assess performance. By synthesizing current studies discuss the advent of LLM-based autonomous agents. These
and outlining future research directions, the survey offers agents extend the functionality of LLMs by interfacing with
valuable insights to further the development and systematic their environment and performing tasks such as literature
evaluation of social agents within game-theoretic contexts. scraping, automated laboratory control, and synthesis plan-
3) GUI Agents: Zhang et al. [54] review LLM-brained ning. Expanding the discussion beyond chemistry, the review
GUI agents, marking a paradigm shift in human-computer also considers applications across other scientific domains.
interaction through integrating multimodal LLMs. It traces
the historical evolution of GUI automation, detailing how E. Comparison with Our Survey
advancements in natural language understanding, code gen-
eration, and visual processing have enabled these agents to Table I presents a consolidated view of how existing works
interpret complex graphical user interface (GUI) elements cover key themes, benchmarks, AI agent frameworks, AI agent
and execute multi-step tasks from conversational commands. applications, AI agents protocols, and challenges & open prob-
The survey systematically examines the core components of lems against our survey. While prior studies typically focus
these systems, including existing frameworks, data collection on one or two aspects (e.g., Yehudai et al. [50] on evaluation
and utilization methods for training, and the development of benchmarks, Singh et al. [49] on RAG architectures, Yan et
specialized large-scale action models for GUI tasks. al. [52] on multi-agent communication, or Wang et al. [28] on
4) Personal LLM Agents: Li et al. [55] explore the evolu- domain-specific applications), none integrate the full spectrum
tion of intelligent personal assistants (IPAs) by focusing on of developments in a single, unified treatment. In contrast,
Personal LLM Agents LLM-based agents that deeply inte- our survey is the first to systematically combine state-of-
grate personal data and devices to provide enhanced personal the-art benchmarks, framework design, application domains,
assistance. The authors outline the limitations of traditional communication protocols, and a forward-looking discussion of
IPAs, including insufficient understanding of user intent, task challenges and open problems, thereby providing researchers
planning, and tool utilization, which have hindered their practi- with a comprehensive roadmap for advancing LLM-based
cality and scalability. In contrast, the emergence of foundation autonomous AI agents.
models like LLMs offer new possibilities by leveraging ad-
vanced semantic understanding and reasoning for autonomous III. LLM AND AGENTIC AI B ENCHMARKS
problem-solving. The survey systematically reviews the archi- This section provides a comprehensive overview of bench-
tecture and design choices underlying Personal LLM Agents, marks developed between 2019 and 2025 that rigorously eval-
informed by expert opinions, and examines key challenges uate large language models (LLMs) across diverse and chal-
related to intelligence, efficiency, and security. Furthermore, it lenging domains. For instance, ENIGMAEVAL [57] assesses
comprehensively analyzes representative solutions addressing complex multimodal puzzle-solving by requiring the synthesis
these challenges, laying the groundwork for Personal LLM of textual and visual clues, while ComplexFuncBench [59]
Agents to become a major paradigm in next-generation end- challenges models with multi-step function-calling tasks that
user software. mirror real-world scenarios. Humanity’s Last Exam (HLE)
5) Scientific Discovery: Gridach et al. [21] explore the [60] further raises the bar by presenting expert-level aca-
transformative role of Agentic AI in scientific discovery, demic questions across a broad spectrum of subjects, thereby
underscoring its potential to automate and enhance research reflecting the growing demand for deeper reasoning and
processes. It reviews how these systems, endowed with reason- domain-specific proficiency. Additional frameworks such as
ing, planning, and autonomous decision-making capabilities, FACTS Grounding [61] and ProcessBench [62] scrutinize
are revolutionizing traditional research activities, including the models’ capacities for generating factually accurate long-
literature reviews, hypothesis generation, experimental design, form responses and detecting errors in multi-step reasoning.
and data analysis. The paper highlights recent advancements Meanwhile, innovative evaluation paradigms like Agent-as-a-
across multiple scientific domains, such as chemistry, biology, Judge [64], JudgeBench [65], and CyberMetric [75] provide
and materials science, by categorizing existing Agentic AI granular insights into cybersecurity competencies and error-
systems and tools. It provides a detailed discussion on key detection capabilities. Tables III, II present a comprehensive
evaluation metrics, implementation frameworks, and datasets overview of benchmarks developed between 2024 and 2025.
6
ENIGMAEVAL 2025 Multimodal Contains 1,184 puzzles combining Evaluates multimodal and Pushes models into unstructured,
[57] Reasoning text and images; state-of-the-art long-context reasoning using creative problem-solving scenarios
systems score only ∼7% on standard challenging puzzles from global requiring integration of visual and
puzzles and fail on the hardest ones. competitions. semantic clues.
MMLU 2021 Multitask Comprises 57 diverse tasks (from Assesses broad world knowledge and Designed for general multitask
Benchmark Knowledge elementary math to professional law) problem-solving skills; uncovers language understanding without
[58] testing zero-shot and few-shot calibration challenges and imbalances task-specific fine-tuning.
performance. between procedural and declarative
knowledge.
ComplexFuncBench 2025 Function Calling Evaluates complex function calling Introduces an automatic evaluation Highlights performance differences
[59] tasks with multi-step operations and framework (ComplexEval) for between closed models (e.g., Claude
input lengths up to 128k tokens over function calling, testing reasoning 3.5, GPT-4) and open models (e.g.,
more than 1,000 scenarios. over implicit parameters and Qwen 2.5, Llama 3.1).
constraints.
Humanity’s 2025 Academic Features 3,000 questions spanning Developed through a global Exposes significant performance gaps
Last Exam Reasoning over 100 subjects, including collaborative effort with nearly 1,000 as state-of-the-art LLMs score below
(HLE) [60] multi-modal challenges. experts; includes both multiple-choice 10%, serving as a critical tool for
and short-answer formats with assessing academic reasoning.
verifiable answers.
FACTS 2023 Factual Grounding Contains 1,719 examples requiring Uses a two-phase evaluation Focuses on factual accuracy and
Grounding detailed responses grounded in source (eligibility and factual grounding) information synthesis while excluding
[61] documents, with inputs reaching up with assessments from frontier LLM creative or complex reasoning tasks.
to 32,000 tokens. judges.
ProcessBench 2024 Error Detection Comprises 3,400 math problem cases Evaluates models’ ability to detect Targets granular error detection in
[62] with step-by-step solutions and the earliest error in reasoning; mathematical problem solving.
human-annotated error locations. compares process reward models with
LLM-based critics.
OmniDocBench 2024 Document A multi-source dataset spanning nine Provides a detailed, multi-level Addresses challenges such as fuzzy
[63] Understanding document types with 19 layout evaluation framework for document scans, watermarks, and complex
categories and 14 attribute labels. content extraction, contrasting layouts in document processing.
modular pipelines with end-to-end
methods.
Agent-as-a- 2024 Evaluation Evaluated on 55 code generation Leverages agentic systems to provide Reduces evaluation cost and time for
Judge [64] Methodology tasks with 365 hierarchical user granular, intermediate feedback; agentic systems, particularly in code
requirements. achieves up to 90% alignment with generation tasks.
human judgments.
JudgeBench 2024 Judgment Consists of 350 challenging response Transforms existing datasets into Aims to objectively assess
[65] Evaluation pairs across knowledge, reasoning, paired comparisons with objective LLM-based judges; fine-tuning can
math, and coding domains. correctness, mitigating positional bias boost judge accuracy significantly.
through double evaluation.
SimpleQA 2023 Factual QA Contains 4,326 fact-seeking questions Focuses on evaluating factual Highlights current limitations in
[66] across domains; uses a strict accuracy and reveals models’ handling straightforward, factual
three-tier grading system. overconfidence in incorrect responses queries.
through repeated testing.
FineTasks [67] 2023 Multilingual Task Evaluates 185 candidate tasks across Employs metrics such as Provides a scalable, multilingual
Selection nine languages, ultimately selecting monotonicity, low noise, non-random evaluation platform that highlights the
96 reliable tasks; supports over 550 performance, and model ordering impact of task formulation.
tasks overall. consistency to assess task quality.
FRAMES [68] 2024 Retrieval & Consists of 824 multi-hop questions Unifies evaluations of factual Baseline experiments show
Reasoning requiring integration of 2–15 accuracy, retrieval, and reasoning; improvements from 40% (without
Wikipedia articles. labels questions with specific retrieval) to 66% (with multi-step
reasoning types (e.g., numerical, retrieval).
tabular).
DABStep [69] 2025 Step-Based A step-based approach for multi-step Decomposes complex problem Highlights the significant challenges
Reasoning reasoning tasks; the best model solving into discrete steps with in training models for complex,
achieves only a 16% success rate. iterative refinement and iterative reasoning.
self-correction.
7
BFCL v2 [70] 2025 Function Calling Contains 2,251 Leverages real-world, Demonstrates that models such as
question-function-answer pairs user-contributed data to address Claude 3.5 and GPT-4 outperform
covering simple to parallel function issues like data contamination and others, while some open models
calls. bias in function calling evaluation. struggle.
SWE-Lancer 2025 Software Consists of over 1,400 freelance Uses triple-verified tests for Indicates that even advanced models
[71] Engineering software engineering tasks, including independent tasks and benchmarks (e.g., Claude 3.5 Sonnet) have low
independent and managerial tasks managerial decisions against hiring pass rates (26.2%) on implementation
with real-world payout data. manager selections. tasks.
CRAG 2024 Retrieval- Comprises 4,409 question-answer Evaluates the generative component Highlights performance drops for
Benchmark Augmented pairs across 5 domains; simulates of RAG pipelines; shows questions involving highly dynamic
[72] Generation retrieval with mock APIs. improvement from 34% to 63% or less popular facts.
accuracy with advanced RAG
methods.
OCCULT 2025 Cybersecurity A lightweight framework for Simulates real-world threat scenarios Preliminary results indicate models
Benchmark operational evaluation of to assess LLM capabilities in like DeepSeek-R1 achieve over 90%
[73] cybersecurity risks; includes three offensive cyber operations. in Threat Actor Competency Tests.
distinct OCO benchmarks.
DIA 2024 Dynamic Problem Uses dynamic question templates Introduces innovative metrics for Reveals gaps in handling complex
Benchmark Solving with mutable parameters across reliability and confidence over tasks and compares models’
[74] domains (math, cryptography, multiple attempts; emphasizes self-assessment abilities.
cybersecurity, computer science). adaptive intelligence.
CyberMetric 2024 Cybersecurity A suite of multiple-choice Q&A Generated using GPT-3.5 and RAG, it Demonstrates that larger,
Benchmark Knowledge datasets (CyberMetric-80, -500, benchmarks cybersecurity knowledge domain-specific models outperform
[75] -2000, -10000) validated over 200 against human performance. smaller ones in cybersecurity
human expert hours. understanding.
BIG-Bench 2025 Challenging An elevated-difficulty variant of Replaces each BBH task with a more Emphasizes substantial room for
Extra Hard Reasoning BIG-Bench Hard; average accuracy is challenging variant to probe improvement in general-purpose
[76] 9.8% for general models and 44.8% reasoning capabilities robustly. reasoning skills.
for reasoning-specialized models.
MultiAgentBench 2025 Multi-Agent Encompasses six domains: research Investigates various coordination GPT-4o-mini achieves the highest
[77] proposal writing, Minecraft structure protocols (star, chain, tree, graph); average task score; highlights synergy
building, database error analysis, peer-to-peer communication plus vs. complexity trade-offs in
collaborative coding, competitive cognitive planning yields a 3% multi-agent LLM settings.
Werewolf gameplay, and resource improvement in milestone
bargaining. achievement. Graph-based protocols
outperform others in research tasks.
GAIA [78] 2024 General AI 466 curated questions with reference Emphasizes everyday reasoning tasks Highlights the large performance gap
Assistants answers; humans achieve 92% involving multi-modality, web between humans and SOTA models;
accuracy while GPT-4 with plugins browsing, and tool use. Targets AI aims to measure truly
only reaches 15%. robustness over specialized skills. general-purpose AI capabilities.
CASTLE [79] 2025 Vulnerability 250 hand-crafted micro-benchmark Integrates evaluations across 13 static Formal verification tools (e.g.,
detection in source programs covering 25 common analysis tools, 10 LLMs, and two ESBMC) minimize false positives but
code CWEs; introduces the novel CASTLE formal verification tools; provides a miss vulnerabilities beyond model
Score metric unified framework for comparing checking; static analyzers generate
diverse methods excessive false positives; LLMs
perform well on small code snippets,
but accuracy declines and
hallucinations increase as code size
grows
SPIN-Bench 2025 Strategic Planning, Evaluates reasoning and strategic Systematically varies action spaces, Reveals that while LLMs perform
[80] Interaction, and behavior in diverse social settings by state complexity, and the number of basic fact retrieval and short-range
Negotiation combining classical PDDL tasks, interacting agents to simulate realistic planning reasonably well, they
competitive board games, cooperative social interactions, providing both a struggle with deep multi-hop
card games, and multi-agent benchmark and an arena for reasoning and socially adept
negotiation scenarios. multi-agent evaluation. coordination, highlighting a
significant gap in robust multi-agent
planning and human–AI teaming.
τ -bench [81] 2024 Conversational Evaluates dynamic, multi-turn Integrates domain-specific API tool Reveals that even state-of-the-art
Agent Evaluation conversations by comparing the final usage and strict policy adherence agents (e.g., GPT-4o) succeed on less
database state with an annotated goal within simulated user interactions to than 50% of tasks, with marked
state using a novel passk metric. assess agent reliability over multiple inconsistency (e.g., pass8 < 25% in
trials. retail), highlighting the need for
improved consistency and
rule-following.
8
granular assessment of their reasoning accuracy. The bench- Additionally, this method offers substantial cost and time
mark is employed to evaluate two classes of models: process savings, reducing evaluation costs to approximately 2.29%
reward models (PRMs) and critic models, the latter involving ($30.58 vs. $1,297.50) and cutting evaluation time down to
general large language models (LLMs) that are prompted to 118.43 minutes compared to 86.5 hours for human assess-
critique each solution step. Experimental results reveal two ments.
key findings. First, existing PRMs generally fail to generalize
to more challenging math problems beyond standard datasets I. JudgeBench Benchmark
like GSM8K and MATH, often underperforming relative to Tan et al. [65] proposed JudgeBench, a novel benchmark
both prompted LLM-based critics and a PRM fine-tuned on designed to objectively evaluate LLM-based judges models
a larger, more complex PRM800K dataset. Second, the best that are increasingly employed to assess and improve the
open-source model tested, QwQ-32B-Preview, demonstrates outputs of large language models by focusing on their ability
error detection capabilities that rival those of the proprietary to accurately discern factual and logical correctness rather than
GPT-4o, although it still falls short compared to reasoning- merely aligning with human stylistic preferences. Unlike prior
specialized models like o1-mini. benchmarks that rely primarily on crowdsourced human evalu-
ations, JudgeBench leverages a carefully constructed set of 350
G. OmniDocBench Benchmark challenging response pairs spanning knowledge, reasoning,
Ouyang et al. [63] introduced OmniDocBench, a compre- math, and coding domains. The benchmark employs a novel
hensive multi-source benchmark designed to advance auto- pipeline to transform challenging existing datasets into paired
mated document content extraction a critical component for comparisons with preference labels based on objective correct-
high-quality data needs in LLMs and RAG systems. Om- ness while mitigating positional bias through double evaluation
niDocBench features a meticulously curated and annotated with swapped order. Comprehensive testing across various
dataset spanning nine diverse document types including aca- judge architectures, including prompted, fine-tuned, multi-
demic papers, textbooks, slides, notes, and financial documents agent judges, and reward models, reveals that even strong
and utilizes a detailed evaluation framework with 19 layout models, such as GPT-4o, often perform only marginally better
categories and 14 attribute labels to facilitate multi-level as- than random guessing, particularly on tasks requiring rigorous
sessments. Through extensive comparative analysis of existing error detection in intermediate reasoning steps. Moreover, fine-
modular pipelines and multimodal end-to-end methods, the tuning can significantly boost performance, as evidenced by
benchmark reveals that while specialized models (e.g., Nougat) a 14% improvement observed in Llama 3.1 8B, and reward
outperform general vision-language models (VLMs) on stan- models achieve accuracies in the 59–64% range.
dard documents, general VLMs exhibit superior resilience and
adaptability in challenging scenarios, such as those involving J. SimpleQA Benchmark
fuzzy scans, watermarks, or colorful backgrounds. Moreover, SimpleQA [66] is a benchmark introduced by OpenAI to
fine-tuning general VLMs with domain-specific data leads to assess and improve the factual accuracy of large language
enhanced performance, as evidenced by high accuracy scores models on short, fact-seeking questions. Comprising 4,326
in tasks like formula recognition (with models such as GPT-4o, questions spanning domains such as science/tech, politics,
Mathpix, and UniMERNet achieving around 85–86.8% accu- art, and geography, SimpleQA challenges models to deliver a
racy) and table recognition (RapidTable at 82.5%). Nonethe- single correct answer under a strict three-tier grading system
less, the findings also highlight persistent challenges, notably (”correct,” ”incorrect,” or ”not attempted”). While built on
that complex column layouts continue to degrade reading order foundational datasets such as TriviaQA and Natural Questions,
accuracy across all evaluated models. SimpleQA presents a more challenging task for LLMs. Early
results indicate that even advanced models, such as OpenAI
H. Agent-as-a-Judge o1-preview, achieve only 42.7% accuracy (with Claude 3.5
Meta team proposed the Agent-as-a-Judge framework [64], Sonnet trailing at 28.9%), and models tend to exhibit over-
an innovative evaluation approach explicitly designed for confidence in their incorrect responses. Moreover, experiments
agentic systems that overcome the limitations of traditional that repeated the same question 100 times revealed a strong
methods, which either focus solely on outcomes or require correlation between higher answer frequency and overall ac-
extensive manual labor. This framework provides granular, curacy. This benchmark thus provides critical insights into the
intermediate feedback throughout the task-solving process by current limitations of LLMs in handling straightforward, fac-
leveraging agentic systems to evaluate other agentic systems. tual queries. It underscores the need for further improvements
The authors demonstrate its effectiveness on code generation in grounding model outputs in reliable, factual data.
tasks using DevAI, a new benchmark comprising 55 real-
istic automated AI development tasks annotated with 365 K. FineTasks
hierarchical user requirements. Their evaluation shows that FineTasks [67] is a data-driven evaluation framework de-
Agent-as-a-Judge not only dramatically outperforms the con- signed to systematically select reliable tasks for assessing
ventional LLM-as-a-Judge approach (which typically achieves LLMs across diverse languages. Developed as the first step
a 60–70% alignment rate with human assessment) but also toward the broader FineWeb Multilingual initiative, Fine-
reaches an impressive 90% alignment with human judgments. Tasks evaluates candidate tasks based on four critical metrics:
10
TABLE IV: LLM Benchmark Comparison: Multimodal, Task Diversity, Reasoning & Agentic AI Evaluation
monotonicity, low noise, non-random performance, and model improvements, experimental results reveal that even the best-
ordering consistency to ensure robustness and reliability. In performing model under this framework only achieves a 16%
an extensive study, the Hugging Face team tested 185 candi- success rate on the evaluated tasks. This modest accuracy un-
date tasks across nine languages (including Chinese, French, derscores the significant challenges that remain in effectively
Arabic, Russian, Thai, Hindi, Turkish, Swahili, and Telugu), training models for complex, iterative reasoning and highlights
ultimately selecting 96 final tasks that cover domains such the need for further research and optimization.
as reading comprehension, general knowledge, language un-
derstanding, and reasoning. The work further reveals that the N. BFCL v2 benchmark
formulation of tasks has a significant impact on performance;
for instance, Cloze format tasks are more effective during Mao et al. [70] propose BFCL v2, a novel benchmark and
early training phases, while multiple-choice formats yield leaderboard designed to evaluate large language models’ func-
better evaluation results. Recommended evaluation metrics tion calling abilities using real-world, user-contributed data.
include length normalization for most tasks and pointwise The benchmark comprises 2,251 question-function-answer
mutual information (PMI) for complex reasoning challenges. pairs, enabling comprehensive assessments across a range of
Benchmarking 35 open and closed-source LLMs demonstrated scenarios from multiple and straightforward function calls to
that open models are narrowing the gap with their proprietary parallel executions and irrelevance detection. By leveraging
counterparts, with Qwen 2 models excelling in high- and mid- authentic user interactions, BFCL v2 addresses prevalent is-
resource languages and Gemma-2 particularly strong in low- sues such as data contamination, bias, and limited gener-
resource settings. Moreover, the FineTasks framework supports alization in previous evaluation methods. Initial evaluations
over 550 tasks across various languages, providing a scalable reveal that models like Claude 3.5 and GPT-4 consistently
and comprehensive platform for advancing multilingual large outperform others, with Mistral, Llama 3.1 FT, and Gemini
language model (LLM) evaluation. following in performance. However, some open models, such
as Hermes, struggle due to potential prompting and formatting
challenges. Overall, BFCL v2 offers a rigorous and diverse
L. FRAMES benchmark platform for benchmarking the practical capabilities of LLMs
Google team [68] propose FRAMES (Factuality, Retrieval, in interfacing with external tools and APIs, thereby providing
and Reasoning MEasurement Set), a comprehensive evaluation valuable insights for future advancements in function calling
dataset specifically designed to assess the capabilities of and interactive AI systems.
retrieval-augmented generation (RAG) systems built on LLMs.
FRAMES addresses a critical need by unifying evaluations of O. SWE-Lancer benchmark
factual accuracy, retrieval effectiveness, and reasoning abil- OpenAI team [71] presents SWE-Lancer, an innovative
ity in an end-to-end framework, rather than assessing these benchmark comprised of over 1,400 freelance software engi-
facets in isolation. The dataset comprises 824 challenging neering tasks collected from Upwork, representing more than
multi-hop questions spanning diverse topics, including history, $1 million in real-world payouts. This benchmark encom-
sports, science, and health, each requiring the integration of passes both independent engineering tasks, ranging from minor
information from between two and fifteen Wikipedia articles. bug fixes to substantial feature implementations valued up to
By labeling questions with specific reasoning types, such as $32,000, and managerial tasks, where models must select the
numerical or tabular. FRAMES provides a nuanced benchmark best technical proposals. Independent tasks are rigorously eval-
to identify the strengths and weaknesses of current RAG uated using end-to-end tests that have been triple-verified by
implementations. Baseline experiments reveal that state-of- experienced engineers. At the same time, managerial decisions
the-art models like Gemini-Pro-1.5-0514 achieve only 40% are benchmarked against the selections made by the original
accuracy when operating without retrieval mechanisms, but hiring managers. Experimental results indicate that state-of-
their performance increases significantly to 66% with a multi- the-art models, such as Claude 3.5 Sonnet, still struggle with
step retrieval pipeline, representing a greater than 50% im- the majority of these tasks, achieving a 26.2% pass rate on
provement. independent tasks and 44.9% on managerial tasks, which
translates to an estimated earning of $403K a figure well below
M. DABStep benchmark the total available value. Notably, the analysis highlights that
while models tend to perform better in evaluative managerial
DabStep [69] is a new framework from Hugging Face that roles than in direct code implementation, increasing inference-
pioneers a step-based approach to enhance the performance time computing can enhance performance.
and efficiency of language models on multi-step reasoning
tasks. DabStep addresses the challenges of traditional end-
to-end inference by decomposing complex problem-solving P. Comprehensive RAG Benchmark (CRAG)
into discrete, manageable steps, enabling models to refine Yang et al. [72] propose the Comprehensive RAG Bench-
their outputs through step-level feedback and iterative dynamic mark (CRAG), a novel dataset designed to evaluate the factual
adjustments. This method is designed to enable models to self- question-answering capabilities of Retrieval-Augmented Gen-
correct and navigate the complexities of multi-step reasoning eration systems rigorously. CRAG comprises 4,409 question-
processes more effectively. However, despite these innovative answer pairs across five domains and eight distinct question
12
topologies, and finds that direct peer-to-peer communication performance bottlenecks in current large language models
and cognitive planning are particularly effective evidenced by (LLMs), which, while adept at factual retrieval and short-
a 3% improvement in milestone achievement when planning range planning, struggle with deep multi-hop reasoning, spa-
is employed while also noting that adding more agents can tial inference, and socially coordinated decision-making. For
decrease performance. Among the models evaluated (GPT- instance, models perform reasonably well on simple tasks like
4o-mini, 3.5, and Llama), GPT-4o-mini achieved the highest Tic-Tac-Toe but falter in complex environments such as Chess
average task score, and graph-based coordination protocols or Diplomacy, and even the best models achieve only around
outperformed other structures in research scenarios. 58.59% accuracy on classical planning tasks.
FineTasks EmbodiedEval
EmbodiedBench
[67] [116] [100] BEARCUBS
Multi-
SWE- ENIGMAEVAL [109]
bench [57]
[121] OlympicArena
[101]
TheoremExplainBench
[112]
DIA
[74]
VisualAgentBench
[89]
Dyn-VQA
[8]
Multimodal,
Task Visual &
TeamCraft JudgeBench Selection Embodied OmniDocBench
[65] BLADE Evaluations [63] EconAgentBench
[95]
AgentHarm [93] [103]
[96]
DiscoveryBench OCCULT
[92] GAIA [73]
τ -bench [78]
[81]
Agent-
CyberMetric
SafetyBench
[75]
[91]
MultiAgentBench
[77]
ScienceAgentBench MedAgentsBench
[90] [99]
SPIN-
Bench Agentic & Domain-
[80]
Interactive Specific
MIRAI LegalAgentBench
Evaluations [87] Evaluations [97]
VeriLA
[104]
BrowseComp
PersonaGym MedChain
[117]
[86] [94]
CapaBench
[105]
DataSciBench
[115]
AgentOrca
[106] LLM Benchmark
MLGym
Robotouille [114]
RefuteBench
[110] DSGBench 2.0
[111] [113]
DROP GPQA
[82] [98]
Academic
& Factual
General Grounding
MMLU
Knowledge & CRAG
[58] Retrieval [72]
Reasoning
BIG-Bench
FRAMES
Extra Hard
[68]
[76]
BFCL v2
Humanity’s [70]
SimpleQA
Last Exam
[66]
[60]
ComplexFuncBench FACTS
DABStep
[59] Grounding
[69]
[61]
Mathematical Code &
Problem Software MLE-
Solving Engineering bench
[119]
Codex
[84]
SWE-
PolyBench
Agent-as- [120]
a-Judge
[64]
CASTLE
[79]
MATH AppWorld
[83] [88] SWE-
SciReplicate- Lancer
MGSM [71]
Bench RefactorBench
[85] ProcessBench ProjectEval
[102] [108]
[62] [107]
FRAMES [68], CRAG [72], DIA [74], CyberMetric [75], TheoremExplainBench [112], RefuteBench 2.0 [113], ML-
TeamCraft [95], AgentHarm [96], τ -bench [81], LegalAgent- Gym [114], DataSciBench [115], EmbodiedBench [116],
Bench [97], and GPQA [98]. BrowseComp [117], and MLE-bench [119]. Collectively, these
benchmarks exemplify the field’s shift towards more compre-
Recent benchmarks from 2025 further indicate a sub-
hensive and nuanced evaluation metrics, supporting the de-
stantial expansion in the depth and breadth of large lan-
velopment of LLMs that can tackle increasingly multifaceted,
guage model (LLM) evaluations. ENIGMAEVAL [57] and
real-world challenges.
ComplexFuncBench [59] target complex puzzles and func-
tion calling tasks, while MedAgentsBench [99] and Hu- Fig. 2 groups benchmarks into categories such as Academic
manity’s Last Exam [60] focus on advanced medical rea- & General Knowledge Reasoning, Mathematical Problem
soning and expert-level academic tasks. Additional bench- Solving, Code & Software Engineering, Factual Grounding
marks such as DABStep [69], BFCL v2 [70], SWE- & Retrieval, Domain-Specific Evaluations, Multimodal/Visual
Lancer [71], and OCCULT [73] further diversify evalua- & Embodied Evaluations, Task Selection, and Agentic &
tive criteria by incorporating multi-step reasoning, cyberse- Interactive Evaluations, illustrating the full range of tasks used
curity, and freelance software engineering challenges. The to assess LLMs in AI agent settings.
table also includes BIG-Bench Extra Hard [76], MultiA-
gentBench [77], CASTLE [79], EmbodiedEval [100], SPIN- IV. AI AGENTS
Bench [80], OlympicArena [101], SciReplicate-Bench [102],
EconAgentBench [103], VeriLA [104], CapaBench [105], This section presents a comprehensive overview of AI agent
AgentOrca [106], ProjectEval [107], RefactorBench [108], frameworks and applications developed between 2024 and
BEARCUBS [109], Robotouille [110], DSGBench [111], 2025, highlighting transformative approaches that integrate
15
LangChain [124] Integrates LLMs with diverse tools to build Combines conversational LLMs, search Customizable roles and streamlined agent
autonomous agents. integrations, and utility functions into prototyping.
iterative workflows.
LlamaIndex [125] Enables autonomous agent creation via Wraps functions into FunctionTool Simplifies agent development with a
external tool integration. objects and employs a ReActAgent for dynamic, modular pipeline.
stepwise tool selection.
CrewAI [126] Orchestrates teams of specialized AI agents Structures systems into Crew (oversight), AI Mimics human team collaboration with
for complex tasks. Agents (specialized roles), Process flexible, parallel workflows.
(collaboration), and Tasks (assignments).
Swarm [127] Provides a lightweight, stateless abstraction Defines multiple agents with specific Fine-grained control and compatibility with
for multi-agent systems. instructions and roles; enables dynamic various backends.
handoffs and context management.
GUI Agent [128] Facilitates computer control via natural Translates user instructions and screenshots Demonstrates end-to-end performance in
language and visual inputs. into desktop actions (e.g., cursor movements, real-world desktop workflows.
clicks).
Agentic Reasoning [129] Enhances reasoning by integrating Leverages web-search, coding, and Mind Achieves improved multi-step
specialized external tool-using agents. Map agents to iteratively refine multi-step problem-solving and structured knowledge
reasoning. synthesis.
OctoTools [130] Empowers LLMs for complex reasoning via Combines standardized tool cards, a strategic Outperforms similar frameworks by up to
training-free tool integration. planner, and an executor for effective tool 10.6% on varied tasks.
usage.
Agents SDK [131] Provides a modular framework for building Offers core primitives such as Agents (LLMs Streamlines development with an extensible,
autonomous agent applications that integrate with instructions, tools, handoffs, and robust architecture that enhances
LLMs with external tools and advanced guardrails), Tools (wrapped functions/APIs), debuggability and scalability, enabling rapid
features. Context for state management, along with prototyping and seamless integration of
support for Streaming, Tracing, and complex, multi-agent workflows.
Guardrails to manage multi-turn interactions.
TABLE VI: Comparative Analysis of LLM Strategies in RAG, AI Agents, and Agentic RAG
Feature LLM Pre-trained LLM Post Training & RAG AI Agents Agentic RAG
Fine Tuning
Core Function Uses LLM for text Applies task-specific Retrieves data and Automates tasks and Integrates retrieval with
generation. tuning. generates text. decisions. adaptive reasoning.
Autonomy Basic language Enhances autonomy Limited; user-driven. Moderately autonomous. Highly autonomous.
understanding. through tuning.
Learning Relies on pre-training. Uses fine tuning for Static pre-trained Incorporates user Adapts using real-time
precision. knowledge. feedback. data.
Use Cases General applications. Domain-specific Q&A, summaries, Chatbots, automation, Complex
enhancements. guidance. workflow. decision-making tasks.
Complexity Provides baseline Adds refined Simple integration. More sophisticated. Highly complex.
complexity. capabilities.
Reliability Depends on static Improves consistency Consistent for known May vary with dynamic Reliability boosted by
training data. with updates. queries. inputs. adaptive methods.
Scalability Scales with model size. Scales with Easily scalable for static Scales moderately with Scalable for complex
domain-specific tuning. tasks. added features. systems (with extra
resources).
Integration Easily integrable with Requires domain Integrates well with Connects with Supports advanced
various apps. customization. retrieval systems. operational workflows. decision frameworks.
efficient execution; and Tasks, which are individual assign- of LLMs to control computers via GUI, while also identifying
ments with clear objectives that contribute to a larger goal. the need for more comprehensive, multimodal datasets to
Key features of CrewAI include role-based agent specializa- capture real-world complexities.
tion, flexible integration of custom tools and APIs, intelligent The paper by Sun et al. [145] tackles a major challenge
collaboration that mimics natural human interaction, and ro- in training GUI agents powered by Vision-Language Models
bust task management supporting both sequential and parallel (VLMs): collecting high-quality trajectory data. Traditional
workflows. Together, these elements enable the creation of methods relying on human supervision or synthetic data
dynamic, production-ready AI teams capable of achieving generation via pre-defined tasks are either resource-intensive
sophisticated, multi-step objectives in real-world applications. or fail to capture the complexity and diversity of real-world
4) Swarm: Swarm [127] is a lightweight, experimental environments. The authors propose OS-Genesis, a novel data
library from OpenAI designed to build and manage multi- synthesis pipeline that reverses the conventional trajectory
agent systems without relying on the Assistants API. Swarm collection process to overcome these limitations. Rather than
provides a stateless abstraction that orchestrates a continu- starting with fixed tasks, OS-Genesis enables agents to explore
ous loop of agent interactions, function calls, and dynamic environments through step-by-step interactions and then derive
handoffs, offering fine-grained control and transparency. Key high-quality tasks retrospectively, with a trajectory reward
features include: model ensuring data quality.
6) Agentic Reasoning: Wu et al. [129] presents a novel
• Agent Definition: Developers can define multiple agents, framework that significantly enhances the reasoning capa-
each equipped with its own set of instructions, designated bilities of large language models by integrating external
role (e.g., ”Sales Agent”), and available functions, which tool-using agents into the inference process. The approach
are converted into standardized JSON structures. leverages three key agents: a web-search agent for real-time
• Dynamic Handoffs: Agents can transfer control to one an- retrieval of pertinent information, a coding agent for executing
other based on the conversation flow or specific function computational tasks, and a Mind Map agent that constructs
criteria, simply by returning the next agent to call. structured knowledge graphs to track and organize logical
• Context Management: Context variables are used to ini- relationships during reasoning. By dynamically engaging these
tialize and update state throughout the conversation, en- specialized agents, the framework enables LLMs to perform
suring continuity and effective information sharing across multi-step, expert-level problem solving and deep research,
agents. addressing limitations in conventional internal reasoning ap-
• Client Orchestration: The Client.run() function initiates proaches. Evaluations on challenging benchmarks such as
and manages the multi-agent dialogue by taking an initial the GPQA dataset and domain-specific deep research tasks
agent, user messages, and context, and then returning demonstrate that Agentic Reasoning substantially outperforms
updated messages, context variables, and the last active traditional retrieval-augmented generation systems and closed-
agent. source models, highlighting its potential for improved knowl-
• Direct Function Calling & Streaming: Swarm supports edge synthesis, test-time scalability, and structured problem-
direct Python function calls within agents and provides solving.
streaming responses for real-time interactions. OctoTools [130] is a robust, training-free, and user-friendly
• Flexibility: The framework is designed to be agnostic to framework designed to empower large language models to
the underlying OpenAI client, working seamlessly with tackle complex reasoning tasks across diverse domains. By
tools such as Hugging Face TGI or vLLM hosted models. integrating standardized tool cards that encapsulate various
5) GUI Agent: Hu et al. [128] introduced Claude 3.5 tool functionalities, a planner for orchestrating both high-level
Computer Use, marking a significant milestone as the first and low-level strategies, and an executor for effective tool us-
frontier AI model to offer computer control via a graphical age, OctoTools overcomes the limitations of prior methods that
user interface in a public beta setting. The study assembles a were confined to specialized domains or required extra training
diverse set of tasks, ranging from web search and productivity data. Validated across 16 varied tasks including MathVista,
workflows to gaming and file management, to rigorously MMLU-Pro, MedQA, and GAIA-Text OctoTools achieves an
evaluate the model’s ability to translate natural language average accuracy improvement of 9.3% over GPT-4o and
instructions and screenshots into precise desktop actions, such outperforms frameworks like AutoGen, GPT-Functions, and
as cursor movements, clicks, and keystrokes. The evaluation LangChain by up to 10.6% when using the same toolset.
framework not only demonstrates Claude 3.5’s unprecedented Comprehensive analysis and ablation studies demonstrate its
end-to-end performance, with a success rate of 16 out of advantages in task planning, effective tool integration, and
20 test cases, but also highlights critical areas for future multi-step problem solving, positioning it as a significant
refinement, including improved planning, action execution, advancement for general-purpose, complex reasoning appli-
and self-critique capabilities. Moreover, the performance is cations.
shown to be influenced by factors like screen resolution, and 7) Agents SDK: The OpenAI Agents SDK [131] provides a
the study reveals that while the model can perform a wide comprehensive framework for building autonomous, multi-step
range of operations, it still struggles with replicating subtle agent applications that harness the power of large language
human-like behaviors such as natural scrolling and browsing. models alongside external tools. This SDK abstracts the core
Overall, this preliminary exploration underscores the potential components necessary for agentic workflows, including agents
19
themselves which are LLMs configured with instructions, cardiologists. Designed to address the limitations of general-
tools, handoffs, and guardrails as well as the tools that enable purpose large language models (LLMs) in clinical settings,
these agents to perform external actions (such as API calls ZODIAC leverages a multi-agent collaboration architecture to
or computations). It also supports context management to process patient data across multiple modalities. Each agent
maintain state over multi-turn interactions, structured output is fine-tuned using real-world patient data adjudicated by
types for reliable data exchange, and advanced features like cardiologists, ensuring the system’s diagnostic outputs, such as
streaming, tracing, and guardrails to ensure safety and debuga- the extraction of clinically relevant characteristics, arrhythmia
bility. detection, and preliminary report generation, are accurate
and reliable. Rigorous clinical validation, conducted by in-
dependent cardiologists and evaluated across eight metrics
B. AI Agent applications
addressing clinical effectiveness and security, demonstrates
AI Agents are autonomous systems that combine large that ZODIAC outperforms industry-leading models, including
language models (LLMs), data retrieval mechanisms, and GPT-4o, Llama-3.1-405B, Gemini-pro, and even specialized
decision-making pipelines to tackle a wide array of tasks medical LLMs like BioGPT. Notably, the successful inte-
across industries. In healthcare, they assist with clinical di- gration of ZODIAC into electrocardiography (ECG) devices
agnosis and personalized treatment planning; in finance, they underscores its potential to transform healthcare delivery,
support forecasting and risk analysis; in scientific research, exemplifying the emerging trend of embedding LLMs within
they automate literature review and experimental design; and Software-as-Medical-Device (SaMD) solutions.
in software engineering, they generate, analyze, and repair Wang et al. [148] introduce MedAgent-Pro, an evidence-
code. Using domain-specific fine-tuning and structured data based, agentic system designed to enhance multi-modal med-
sources, AI agents can also drive the generation of syn- ical diagnosis by addressing key limitations of current Multi-
thetic data, facilitate chemical reasoning, support mathematical modal Large Language Models (MLLMs). While MLLMs
problem-solving, and enable creative multimedia production, have demonstrated strong reasoning and task-performing capa-
thereby expanding the reach of AI-powered automation and bilities, they often struggle with detailed visual perception and
insight generation. Fig. 7 presents both the architectural back- exhibit reasoning inconsistencies, both of which are critical in
bone and the application landscape of AI Agents. clinical settings. MedAgent-Pro employs a hierarchical work-
1) Healthcare Applications: The healthcare sector has wit- flow: at the task level, it leverages knowledge-based reasoning
nessed significant advancements through the integration of to generate reliable diagnostic plans grounded in retrieved
large language model-based agents across a wide range of clinical criteria, and at the case level, it utilizes multiple
applications. In this subsection, we present recent develop- tool agents to process multi-modal inputs and analyze diverse
ments organized into key categories, as presented in Fig. 8, in- indicators. The final diagnosis is derived from a synthesis of
cluding clinical diagnosis and decision support, mental health quantitative and qualitative evidence. Comprehensive experi-
and therapy agents, general medical assistants for workflow ments on both 2D and 3D medical diagnosis tasks demonstrate
optimization, and pharmaceutical and drug discovery agents. that MedAgent-Pro not only outperforms existing methods but
These works demonstrate how AI agents are increasingly sup- also offers enhanced reliability and interpretability, marking a
porting medical professionals, enhancing diagnostic accuracy, significant step forward in AI-assisted clinical diagnostics.
improving patient care, and accelerating research in diverse Feng et al. [150] introduce M3Builder. This novel multi-
healthcare domains. Tab. reviews AI agent applications for agent system automates machine learning workflows in the
Healthcare. medical imaging domain, a field that has traditionally needed
a) Clinical Diagnosis, Imaging & Decision Support: specialized models and tools. M3Builder is structured around
Chen et al. [146] introduce Chain-of-Diagnosis (CoD), a novel four specialized agents that collaboratively manage complex,
approach designed to enhance the interpretability of LLM- multi-step ML tasks, including automated data processing,
based medical diagnostics. By transforming the diagnostic environment configuration, self-contained auto-debugging, and
process into a transparent, step-by-step chain that mirrors a model training, all within a dedicated medical imaging ML
physician’s reasoning, CoD provides a clear reasoning path- workspace. To assess progress in this area, the authors propose
way alongside a disease confidence distribution, which aids M3Bench, a comprehensive benchmark featuring four general
in identifying critical symptoms through entropy reduction. tasks across 14 training datasets, covering five anatomies, three
This transparent methodology not only makes the diagnostic imaging modalities, and both 2D and 3D data. Evaluations
process controllable but also boosts rigor in decision-making. using seven state-of-the-art large language models as agent
Leveraging CoD, the authors developed DiagnosisGPT, an cores, such as the Claude series, GPT-4o, and DeepSeek-V3,
advanced system capable of diagnosing 9,604 diseases. Exper- demonstrate that M3Builder significantly outperforms existing
imental results demonstrate that DiagnosisGPT outperforms ML agent designs, achieving a remarkable 94.29% success rate
existing large language models (LLMs) on diagnostic bench- with Claude-3.7-Sonnet.
marks, achieving both high diagnostic accuracy and enhanced Rose et al. [151] tackles the complexities of differential
interpretability. diagnosis (DDx) by introducing the Modular Explainable DDx
Zhou et al. [147] present ZODIAC, an innovative LLM- Agent (MEDDxAgent) framework, which facilitates interac-
powered framework that elevates cardiological diagnostics tive, iterative diagnostic reasoning rather than relying on com-
to a level of professionalism comparable to that of expert plete patient profiles from the outset. Addressing limitations
20
DiagnosisGPT 2024 Medical Enhance interpretability via a Implements CoD to yield Diagnoses 9,604 diseases;
[146] Diagnos- transparent, step-by-step chain. confidence scores and entropy outperforms existing LLMs.
tics reduction.
ZODIAC 2024 Cardiology Deliver expert-level Multi-agent LLM fine-tuned on Outperforms leading models;
[147] cardiological diagnostics. adjudicated patient data. integrated into ECG devices.
MedAgent- 2025 Medical Enhance multi-modal diagnosis Hierarchical workflow with Outperforms existing methods
Pro [148] Diagnosis by addressing visual and knowledge-based reasoning and on 2D/3D tasks with improved
reasoning gaps. multi-modal agents. reliability.
Steenstra et 2025 Therapeutic Improve counseling training LLM-powered simulated High usability and satisfaction;
al. [149] Counsel- with continuous, detailed patients with turn-by-turn enhances learning vs. traditional
ing feedback. visualizations. methods.
M3Builder 2025 Medical Automate ML workflows in Four agents manage data Achieves 94.29% success with
[150] Imaging medical imaging. processing, configuration, state-of-the-art LLM cores.
ML debugging, and training.
MEDDxAgent 2025 Differential Enable iterative, interactive Integrates a DDxDriver, history Boosts diagnostic accuracy by
[151] Diagnosis differential diagnosis. simulator, and specialized over 10% with enhanced
retrieval/diagnosis agents. explainability.
PathFinder 2025 AI- Replicate holistic WSI analysis Four agents collaboratively Outperforms state-of-the-art by
[152] assisted as done by expert pathologists. generate importance maps and 8%, exceeding average
Diagnos- diagnoses. pathologist performance by 9%.
tics
HamRaz 2025 Therapeutic Provide the first Persian PCT Combines scripted dialogues Produces more empathetic,
[153] Counsel- dataset for LLMs with culturally and adaptive LLM role-play. nuanced, and realistic
ing adapted therapy sessions. counseling interactions.
CAMI 2025 Therapeutic Automate MI-based counseling STAR framework with three Outperforms baselines in MI
[154] Counsel- with client state inference, topic LLM modules for state, topic, competency and counseling
ing exploration, and empathetic and response. realism.
response generation.
AutoCBT 2025 Therapeutic Deliver dynamic CBT via Uses single-turn agents and Generates higher-quality CBT
[155] Counsel- multi-agent routing and dynamic supervisory routing for responses vs. fixed systems.
ing supervision. tailored interventions.
PSYCHE 2025 Psychiatric Benchmark PACAs with Uses detailed psychiatric Validated for clinical
[156] Assess- simulated patient profiles and constructs and board-certified appropriateness and safety.
ment multi-turn interactions. psychiatrist evaluations.
PsyDraw 2024 Mental Analyze HTP drawings with Two-stage feature extraction and 71.03% high consistency with
[157] Health multimodal agents for early report generation; evaluated on experts; scalable screening tool.
Screening screening of LBCs. 290 submissions; pilot
deployment in schools.
EvoPatient 2024 Medical Simulate patient–doctor Iterative multi-turn consultations Improves requirement alignment
[158] Training dialogues for training via refine patient responses and by >10% and achieves higher
unsupervised LLM agents. physician questions over 200 human preference.
case simulations.
Scripted 2024 Therapeutic Constrain LLM responses via Two prompting variants execute Demonstrates reliable script
Therapy Counsel- expert-written scripts and finite 100 simulated sessions adherence and transparent
Agents ing conversational states. following deterministic decision paths.
[159] therapeutic scripts.
LIDDiA 2025 Drug Automate end-to-end drug Orchestrates LLM-driven Generates valid candidates
[160] Discov- discovery from target selection reasoning across all pipeline >70% of cases; identifies novel
ery to lead optimization. steps; evaluated on 30 targets. EGFR inhibitors.
PatentAgent 2024 PharmaceuticalStreamline patent analysis with PA-QA, PA-Img2Mol, Improves image-to-molecule
[161] Patents LLM-driven QA, PA-CoreId modules for accuracy by up to 8.37% and
image-to-molecule, and scaffold comprehensive patent insights. scaffold ID by up to 7.62%.
ID.
DrugAgent 2024 Drug Re- Accelerate drug repurposing via Combines DTI modeling, KG Improves prediction accuracy
[162] purposing multi-agent ML and knowledge extraction, and literature mining and reduces discovery time/cost.
integration. agents.
MAP [163] 2025 Inpatient Support complex inpatient Uses IPDS benchmark; +25.10% diagnostic accuracy
Decision pathways with specialized coordinated by a chief agent for vs. HuatuoGPT2-13B; +10–12%
Support triage, diagnosis, and treatment end-to-end care planning. clinical compliance over
agents. clinicians.
SynthUserEval 2025 Health Generate synthetic users for Creates structured profiles and Enables realistic,
[164] Coaching evaluating behavior-change simulates interactions with health-grounded dialogues;
agents. coaching agents. validated by expert evaluations.
C: Clinical Validation; W: Workflow Integration; R: Regulatory Compliance; : Partial; : Not Supported; : Supported.
21
Sub - AI Agent
applications Agentic AI
LLM model
Biomedical AI Scientist Agents
....... Database
Action
AI Agent applications
Healthcare Research
Materials Science Biomedical Science Software Engineering
Applications Applications
in previous approaches such as evaluations on single datasets, indicate that PathFinder outperforms state-of-the-art methods
isolated component optimization, and single-attempt diag- in skin melanoma diagnosis by 8% and, notably, surpasses the
noses MEDDxAgent integrates three modular components: average performance of pathologists by 9%, establishing a new
an orchestrator (DDxDriver), a history-taking simulator, and benchmark for accurate, efficient, and interpretable AI-assisted
two specialized agents for knowledge retrieval and diagnosis diagnostics in pathology.
strategy. To ensure robust evaluation, the authors also present b) Mental Health, Counseling & Therapy Agents:
a comprehensive DDx benchmark covering respiratory, skin, Wasenmüller et al. [159] present a script-based dialog policy
and rare diseases. Their findings reveal that iterative refinement planning paradigm that enables LLM-powered conversational
significantly enhances diagnostic accuracy, with MEDDxA- agents to function as AI therapists by adhering to expert-
gent achieving over a 10% improvement across both large written therapeutic scripts and transitioning through a finite
and small LLMs while providing critical explainability in its set of conversational states. By treating the script as a deter-
reasoning process. ministic guide, the approach constrains the model’s responses
Ghezloo et al. [152] introduce Pathfinder, a novel multi- to align with a defined therapeutic framework, making decision
modal, multi-agent framework designed to replicate the holis- paths transparent for clinical evaluation and risk management.
tic diagnostic process of expert pathologists when analyz- The authors implement two variants of this paradigm, utilizing
ing whole-slide images (WSIs). Recognizing that WSIs are different prompting strategies, and generate 100 simulated
characterized by their gigapixel scale and complex structure, therapy sessions with LLM-driven patient agents. Experimen-
PathFinder employs four specialized agents a Triage Agent, tal results demonstrate that both implementations can reliably
Navigation Agent, Description Agent, and Diagnosis Agent follow the scripted policy, providing insights into their relative
that collaboratively navigate and interpret the image data. efficiency and effectiveness, and underscoring the feasibility of
The Triage Agent first determines whether a slide is benign building inspectable, rule-aligned AI therapy systems.
or risky; if deemed risky, the Navigation and Description Du et al. [158] introduce EvoPatient, a framework for gen-
Agents iteratively focus on and characterize significant re- erating simulated patients using large language models to train
gions, generating importance maps and detailed natural lan- medical personnel through multi-turn diagnostic dialogues.
guage descriptions. Finally, the Diagnosis Agent synthesizes Existing approaches focus on data retrieval accuracy or prompt
these findings to provide a comprehensive diagnostic classi- tuning, but EvoPatient emphasizes unsupervised simulation to
fication that is inherently explainable. Experimental results teach patient agents standardized presentation patterns. In this
22
system, a patient agent and doctor agents engage in iterative Yang et al. [154] present CAMI, an automated conversa-
consultations, with each dialogue cycle serving to both train tional counselor agent grounded in Motivational Interviewing
the agents and gather experience that refines patient responses (MI), a client-centered approach designed to resolve ambiva-
and physician questions. Extensive experiments across di- lence and promote behavior change. CAMI’s novel STAR
verse clinical scenarios show that EvoPatient improves re- framework integrates three LLM-powered modules client State
quirement alignment by more than 10 percent compared to inference, motivation Topic exploration, and response gEner-
state-of-the-art methods and achieves higher human preference ation to evoke “change talk” in line with MI principles. By
ratings. After evolving through 200 case simulations over a accurately inferring a client’s emotional and motivational state,
period of ten hours, the framework achieves an optimal balance exploring relevant topics, and generating empathetic, directive
between resource efficiency and performance, demonstrating responses, CAMI facilitates more effective counseling across
strong generalizability for scalable medical training. diverse populations. The authors evaluate CAMI using both
Zhang et al. [157] present PsyDraw, a multimodal LLM- automated metrics and manual assessments with simulated
driven multi-agent system designed to support mental health clients, measuring MI skill competency, state inference ac-
professionals in analyzing House-Tree-Person (HTP) drawings curacy, topic exploration proficiency, and overall counseling
for early screening of left-behind children (LBCs) in rural success. Results demonstrate that CAMI outperforms existing
China. Recognizing the acute shortage of clinicians, PsyDraw methods and exhibits counselor-like realism, while ablation
employs specialized agents for detailed feature extraction studies highlight the essential contributions of the state in-
and psychological interpretation in two stages: comprehensive ference and topic exploration modules to its superior perfor-
analysis of drawing elements and automated generation of mance.
professional reports. Evaluated on 290 primary-school HTP Steenstra et al. [149] address the challenges in therapeutic
submissions, PsyDraw achieved High Consistency with expert counseling training by proposing an innovative LLM-powered
evaluations in 71.03% of cases and Moderate Consistency system that provides continuous, detailed feedback during
in 26.21%, flagging 31.03% of children as needing further simulated patient interactions. Focusing on motivational in-
attention. Deployed in pilot schools, PsyDraw demonstrates terviewing a counseling approach emphasizing empathy and
strong potential as a scalable, preliminary screening tool that collaborative behavior change the framework features a sim-
maintains high professional standards and addresses critical ulated patient and visualizations of turn-by-turn performance
mental health gaps in resource-limited settings. to guide counselors through role-play scenarios. The system
Lee et al. [156] introduce PSYCHE, a comprehensive frame- was evaluated with both professional and student counselors,
work for benchmarking psychiatric assessment conversational who reported high usability and satisfaction, indicating that
agents (PACAs) built on large language models. Recognizing frequent and granular feedback can significantly enhance the
that psychiatric evaluations rely on nuanced, multi-turn inter- learning process compared to traditional, intermittent methods.
actions between clinicians and patients, PSYCHE simulates Abbasi et al. [153] introduce HamRaz, the first Persian-
patients using a detailed psychiatric construct that specifies language dataset tailored for Person-Centered Therapy (PCT)
their profiles, histories, and behavioral patterns. This approach with large language models (LLMs), addressing a critical
enables clinically relevant assessments, ensures ethical safety gap in culturally and linguistically appropriate mental health
checks, facilitates cost-efficient deployment, and provides resources. Recognizing that existing counseling datasets are
quantitative evaluation metrics. The framework was validated largely confined to Western and East Asian contexts, the
in a study involving ten board-certified psychiatrists who authors design HamRaz by blending scripted therapeutic dia-
reviewed and rated the simulated interactions, demonstrating logues with adaptive LLM-driven role-playing to foster coher-
PSYCHE’s ability to rigorously evaluate PACAs’ clinical ent, dynamic therapy sessions in Persian. To rigorously assess
appropriateness and safety. performance, they propose HamRazEval, a dual evaluation
Xu et al. [155] addresses the limitations of existing LLM- framework combining general dialogue quality metrics with
based Cognitive Behavioral Therapy (CBT) systems, namely the Barrett–Lennard Relationship Inventory (BLRI) to measure
their rigid agent structures and tendency toward redundant, un- therapeutic rapport and effectiveness. Experimental compar-
helpful suggestions, by proposing AutoCBT, a dynamic multi- isons demonstrate that LLMs trained on HamRaz generate
agent framework for automated psychological counseling. Ini- more empathetic, contextually nuanced, and realistic counsel-
tially, the authors develop a general single-turn consultation ing interactions than conventional Script Mode or Two-Agent
agent using Quora-like and YiXinLi models, evaluated on Mode approaches.
a bilingual dataset to benchmark response quality in single- c) General Medical Assistants, Clinical Workflow & De-
round interactions. Building on these insights, they introduce cision Making: Yun et al. [164] introduce an end-to-end
dynamic routing and supervisory mechanisms modeled af- framework for generating synthetic users to evaluate inter-
ter real-world counseling practices, enabling agents to self- active agents aimed at promoting positive behavior change,
optimize and tailor interventions more effectively. Experimen- focusing on sleep and diabetes management. The framework
tal results demonstrate that AutoCBT generates higher-quality first generates structured data based on real-world health and
CBT-oriented responses compared to fixed-structure systems, lifestyle factors, demographics, and behavioral attributes. Next,
highlighting its potential to deliver scalable, empathetic, and it creates complete user profiles conditioned on this structured
contextually appropriate psychological support for users who data. Interactions between synthetic users and health coaching
might otherwise avoid in-person therapy. agents are simulated using generative agent models such as
23
Concordia or by directly prompting a language model. Case PA-Img2Mol for converting chemical structure images into
studies with sleep and diabetes coaching agents demonstrate molecular representations, and PA-CoreId for identifying core
that the synthetic users enable realistic dialogue by accurately chemical scaffolds. PA-Img2Mol achieves accuracy gains of
reflecting users’ needs and challenges. Blinded evaluations by 2.46 to 8.37 percent across CLEF, JPO, UOB, and USPTO
human experts confirm that these health-grounded synthetic patent image benchmarks, while PA-CoreId delivers improve-
users portray real human users more faithfully than generic ments of 7.15 to 7.62 percent on the PatentNetML scaffold
synthetic users. This approach provides a scalable and realistic identification task. By combining these modules within a
testing ground for developing and refining conversational unified framework, PatentAgent addresses the full spectrum
agents in health and lifestyle coaching. of patent analysis needs, from extracting detailed experimental
Chen et al. [163] address the complexity of clinical decision- insights to pinpointing key molecular structures, and offers a
making in inpatient pathways by introducing both a new powerful tool to accelerate research and innovation in drug
benchmark and a multi-agent AI framework. The authors con- discovery.
struct the Inpatient Pathway Decision Support (IPDS) bench- Averly et al. [160] introduce LIDDiA, an autonomous in
mark from the MIMIC-IV database, comprising 51,274 cases silico agent designed to navigate the entire drug discovery
across nine triage departments, 17 disease categories, and 16 pipeline by leveraging the reasoning capabilities of large
standardized treatment options to capture the multifaceted na- language models. Unlike prior AI tools that address individual
ture of inpatient care. Building on this resource, they propose steps such as molecule generation or property prediction,
the Multi-Agent Inpatient Pathways (MAP) framework, which LIDDiA orchestrates the end-to-end process from target selec-
employs a triage agent for patient admission, a diagnosis agent tion through lead optimization. The authors evaluate LIDDiA
for department-level decision-making, and a treatment agent on 30 clinically relevant targets and show that it generates
for care planning, all coordinated by a chief agent that oversees candidate molecules satisfying key pharmaceutical criteria in
the entire pathway. In extensive experiments, MAP achieves over 70 percent of cases. Furthermore, LIDDiA demonstrates
a 25.10% improvement in diagnostic accuracy over the state- an intelligent balance between exploring novel chemical space
of-the-art LLM HuatuoGPT2-13B and surpasses three board- and exploiting known scaffolds and successfully identifies
certified clinicians in clinical compliance by 10–12%. These promising new inhibitors for the epidermal growth factor
results demonstrate the potential of multi-agent systems to receptor (EGFR), a major oncology target.
support complex inpatient workflows and lay the groundwork Inoue et al. [162] present a multi-agent framework designed
for future AI-driven decision support in hospital settings. to accelerate drug repurposing by combining machine learning
and knowledge integration. The system includes three spe-
cialized agents: an AI Agent that trains robust drug–target
PatentAgent
[161] MAP
Framework
interaction (DTI) models, a Knowledge Graph Agent that
LIDDiA
[160]
[163]
extracts DTIs from databases such as DGIdb, DrugBank, CTD
Synthetic
Users and STITCH, and a Search Agent that mines biomedical
[164]
Drug
Repurposing
literature to validate computational predictions. By integrating
[162]
outputs from these agents, the framework leverages diverse
General Medical
Pharmaceutical &
Drug-Related
Assistants,
Clinical Workflow
data sources to identify promising candidates for repurposing.
Agents
& Decision Making
Preliminary evaluations indicate that this approach not only
enhances the accuracy of drug–disease interaction predictions
compared to existing methods but also reduces the time
Healthcare HamRaz and cost associated with traditional drug discovery. The in-
Applications [153]
terpretable results and scalable architecture demonstrate the
CoD Scaffolding potential of multi-agent systems to drive innovation and effi-
[146] [149]
ciency in biomedical research.
Clinical Diagnosis, Mental Health,
Imaging &
Decision Support
Counseling &
Therapy Agents
2) Materials Science: Materials science has recently ben-
ZODIAC CAMI
[147] [154] efited from the integration of LLM-based agents, which are
helping to automate complex scientific workflows and enhance
MedAgent-Pro
[148]
AutoCBT
[155]
research efficiency. In this subsection, we highlight two no-
Script
Planning table developments, including the application of AI agents
M3Builder [159] PSYCHE
[150]
MEDDxAgent
PathFinder EvoPatient
PsyDraw
[156] in astronomical observations to streamline data collection and
[152] [158]
[151] [157]
analysis, and the creation of specialized agent systems tailored
to address the unique challenges of materials science research.
Fig. 8: Agent LLM Applications for Healthcare a) LLM-Based Agents for Astronomical Observations:
The StarWhisper Telescope System [132] leverages LLM-
d) Pharmaceutical & Drug-Related Agents: Wang et al. based agents to streamline the complex workflow of astro-
[161] introduce PatentAgent, the first end-to-end intelligent nomical observations within the Nearby Galaxy Supernovae
agent designed to streamline pharmaceutical patent analysis Survey (NGSS) project. This innovative system automates crit-
by leveraging large language models. PatentAgent integrates ical tasks including generating customized observation lists,
three core modules: PA-QA for patent question answering, initiating telescope observations, real-time image analysis, and
24
formulating follow-up proposals to reduce the operational steps, all within a thinking token framework that fosters
burden on astronomers and lower training costs. By integrating iterative feedback loops.
these agents into the observation process, the system can effi- c) Biomedical AI Scientist Agents: Lin et al. [165] intro-
ciently verify and dispatch observation lists, analyze transient duce BioKGBench, a novel benchmark designed to evaluate
phenomena in near real-time, and seamlessly communicate biomedical AI scientist agents from the perspective of litera-
results to observatory teams for subsequent scheduling. ture understanding. Unlike traditional evaluation methods that
b) Materials Science Research: HoneyComb [133] is rely solely on direct QA or biomedical experiments, BioKG-
introduced as the first LLM-based agent system tailored ex- Bench decomposes the critical ability of “understanding liter-
plicitly for materials science, addressing the unique challenges ature” into two atomic tasks: one that verifies scientific claims
posed by complex computational tasks and outdated implicit in unstructured text from research papers and another that in-
knowledge that often lead to inaccuracies and hallucinations volves interacting with structured knowledge-graph question-
in general-purpose LLMs. The system leverages a novel, high- answering (KGQA) for literature grounding. Building on these
quality materials science knowledge base (MatSciKB) curated components, the authors propose a new agent task called
from reliable literature and a sophisticated tool hub (Tool- KGCheck, which uses domain-based retrieval-augmented gen-
Hub) that employs an Inductive Tool Construction method eration to identify factual errors in large-scale knowledge
to generate, decompose, and refine specialized API tools. graph databases. With a dataset of over 2,000 examples for
Additionally, the retriever module adaptively selects the most the atomic tasks and 225 high-quality annotated samples for
relevant knowledge sources and tools for each task, ensuring the agent task, the study reveals that state-of-the-art agents
high accuracy and contextual relevance. both in everyday and biomedical settings perform poorly or
3) Biomedical Science: The biomedical field has seen suboptimally on this benchmark.
important progress through the development of LLM-based 4) Research Applications: LLM-based agents are increas-
agents designed to support knowledge discovery, enhance ingly being developed to support and automate various aspects
reasoning capabilities, and evaluate scientific literature. In of the scientific research process. This subsection presents
this subsection, we review recent contributions that focus on a selection of recent applications, including collaborative re-
gene set analysis, iterative learning for improved reasoning, search environments, automated survey generation, structured
and the evaluation of AI scientist agents through specialized literature analysis for ideation, workflow management in data
biomedical benchmarks. science, and AI-driven hypothesis generation.
a) Gene Set Knowledge Discovery: Gene set knowl- a) Collaborative Research Among LLM Agents:
edge discovery is crucial for advancing human functional Schmidgall and Moor [166] introduces AgentRxiv, a frame-
genomics, yet traditional LLM approaches often suffer from work designed to enable collaborative research among au-
issues like hallucinations. To address this, Wang et al. [134] tonomous LLM agent laboratories by leveraging a shared
introduce GeneAgent a pioneering language agent with self- preprint server. Recognizing that scientific discovery is inher-
verification capabilities that autonomously interacts with bio- ently incremental and collaborative, AgentRxiv allows agents
logical databases and leverages specialized domain knowledge to upload and retrieve research reports, thereby sharing in-
to enhance accuracy. Benchmarking on 1,106 gene sets from sights and building upon previous work in an iterative manner.
diverse sources, GeneAgent consistently outperforms standard The study demonstrates that agents with access to prior
GPT-4, and a detailed manual review confirms that its self- research achieve a significant performance boost an 11.4%
verification module effectively minimizes hallucinations and relative improvement on the MATH-500 dataset compared to
produces more reliable analytical narratives. Moreover, when those operating in isolation. Furthermore, the best-performing
applied to seven novel gene sets derived from mouse B2905 collaborative strategy generalizes to other domains with an
melanoma cell lines, expert evaluations reveal that GeneAgent average improvement of 3.3%, and when multiple agent lab-
offers novel insights into gene functions, significantly ex- oratories share their findings, overall accuracy increases by
pediting the process of knowledge discovery in functional 13.7% relative to the baseline. These findings highlight the
genomics. potential of autonomous agents to collaborate with humans,
b) Reasoning with Recursive Learning: Buehler et al. paving the way for more efficient and accelerated scientific
[135] proposed a framework, named PRefLexOR, that fuses discovery.
preference optimization with reinforcement learning concepts b) Automated Survey Generation: Liang et al. [136]
to enable language models to self-improve through iterative, developed the SurveyX platform, which leverages the excep-
multi-step reasoning. The approach employs a recursive learn- tional comprehension and knowledge capabilities of LLMs
ing strategy in which the model repeatedly revisits and refines to overcome critical limitations in automated survey gener-
intermediate reasoning steps before producing a final output, ation, including finite context windows, superficial content
both during training and inference. Initially, the model aligns discussions, and the lack of systematic evaluation frameworks.
its reasoning with accurate decision paths by optimizing the Inspired by human writing processes, SurveyX decomposes
log odds between preferred and non-preferred responses while the survey composition process into two distinct phases:
constructing a dynamic knowledge graph through question Preparation and Generation. During the preparation phase,
generation and retrieval augmentation. In a subsequent stage, the system incorporates online reference retrieval and applies
rejection sampling is employed to refine the reasoning quality a novel preprocessing method, AttributeTree, to effectively
by generating in-situ training data and masking intermediate structure the survey’s content. In the subsequent Generation
25
AgentRxiv [166] 2025 Collaborative Share and build upon Upload/retrieve via +11.4% on MATH-500 AgentRxiv Preprint
Research preprints across shared preprint server MATH-500; +3.3% benchmark server sharing
autonomous LLM with iterative updates. cross-domain; +13.7%
labs. multi-lab.
SurveyX [136] 2025 Survey Automate systematic Preparation (retrieval +0.259 content Content & Bibliographic Structured
Generation literature surveys with + AttributeTree) + quality; +1.76 citation citation APIs citations
high quality. Generation precision vs. baselines. scoring
(repolishing).
CoI Agent [137] 2024 Research Structure literature Sequential Expert-comparable Idea Arena CoI Cost-efficient
Ideation into progressive idea Chain-of-Ideas + Idea idea quality at $0.50 framework ideation
chains. Arena evaluation per idea.
protocol.
Data Interpreter 2024 Data Manage end-to-end, Hierarchical Graph +25% on InfiAgent Pipeline APIs Reproducible
[167] Science dynamic DS pipelines Modeling + InfiAgent-DABench DABench workflows
Workflows robustly. Programmable Node (75.9→94.9%); ML &
Generation. MATH gains.
AI Co-Scientist 2025 Scientific Generate and refine Seven specialized +300 Elo hypothesis Elo & Multi-agent Hypothesis
[168] Discovery research hypotheses agents with Elo quality; +27% novelty novelty pipeline publication
autonomously. tournaments and scores. scoring
meta-review.
Eval. Framework: Evaluation Framework; Collab. Platform: Collaboration Platform; Open Sci.: Open Science Support.
phase, a repolishing process refines the output to enhance verifies each subproblem to boost the robustness of code
the depth and accuracy of the study generated, particularly generation. Extensive experiments demonstrate significant per-
improving content quality and citation precision. Experimental formance gains achieving up to a 25% boost on InfiAgent-
evaluations reveal that SurveyX achieves a content quality DABench (increasing accuracy from 75.9% to 94.9%), as
improvement of 0.259 and a citation quality enhancement of well as improvements on machine learning, open-ended tasks,
1.76 over existing systems, bringing its performance close to and the MATH dataset highlighting its superior capability
that of human experts across multiple evaluation dimensions. in managing evolving task dependencies and real-time data
c) Structuring Literature for Research Ideation: Li et adjustments.
al. [137] introduce the Chain-of-Ideas (CoI) agent, a novel e) Automating Scientific Discovery: Google [168] intro-
LLM-based framework for automating research ideation by duced the AI co-scientist, a multi-agent system built on Google
structuring relevant literature into a chain that mirrors the DeepMind Gemini 2.0, designed to automate scientific dis-
progressive development within a research domain. The CoI covery by generating and refining novel research hypotheses.
agent addresses the challenge posed by the exponential growth The framework comprises seven specialized agents Supervisor,
of scientific literature, which overwhelms traditional idea- Generation, Reflection, Ranking, Evolution, Proximity, and
generation methods that rely on simple prompts or expose Meta-review that collaboratively manage tasks ranging from
models to raw, unfiltered text. By organizing information in a parsing research goals to conducting simulated debates and
sequential chain, the CoI agent enables LLMs to capture cur- organizing hypotheses. For example, the system employs a
rent advancements more effectively, enhancing their ability to Ranking Agent that uses pairwise Elo tournaments, boosting
generate innovative research ideas. Complementing this frame- hypothesis quality by over 300 Elo points. At the same time,
work is the Idea Arena, an evaluation protocol that assesses the Meta-review Agent’s feedback has been shown to increase
the quality of generated ideas from multiple perspectives, hypothesis novelty scores by 27%. In practical applications,
aligning closely with the preferences of human researchers. such as drug repurposing for acute myeloid leukemia and novel
Experimental results indicate that the CoI agent outperforms target discovery for liver fibrosis, the framework demonstrates
existing methods and achieves quality comparable to human significant performance improvements, paving the way for
experts, all while maintaining a low cost approximately $0.50 AI systems that can generate and iteratively refine scientific
per candidate idea and corresponding experimental design. hypotheses with expert-level precision.
d) Managing Data Science Workflows: Hong et al. [167] 5) Software Engineering: Software engineering has be-
propose Data Interpreter, an LLM-based agent that tackles come a significant area of application for LLM-based agents,
end-to-end data science workflows by addressing challenges with innovations spanning architecture design and verification
in solving long-term, interconnected tasks and adapting to systems, adaptive control, software analytics, and multi-agent
dynamic data environments. Unlike previous methods that collaboration. This subsection presents recent developments
focus on individual tasks, Data Interpreter leverages two key across a wide range of tasks, including agent programming
modules: Hierarchical Graph Modeling, which decomposes frameworks, tutoring systems, automated environment config-
complex problems into manageable subproblems through dy- uration, usability testing, and multilingual code generation.
namic node generation and graph optimization, and Pro- Fig. 9 presents a classification of Agent LLM Applications
grammable Node Generation, which iteratively refines and for Software Engineering.
26
Ann Arbor 2025 Agent Treat LLMs as automata, Introduces the Ann Arbor Early experiments show
Architec- Program- enabling programming via conceptual framework and improved in-context learning.
ture [169] ming formal and natural languages. Postline platform.
Arch.
AgentGym 2025 Verification Scalable training of SWE-agents Leverages SYNGEN synthetic Achieves 51% pass rate on
[170] & Super- via SYNGEN data curation and data and Hybrid Test-time SWE-Bench Verified.
vision Hybrid Test-time Scaling. Scaling on SWE-Gym; trained
on SWE-Bench Verified.
TRAVER&DICT2025 Intelligent Trace-and-Verify workflow for Combines knowledge tracing Significant improvements in
[171] Tutoring stepwise coding guidance; with turn-by-turn verification; coding-tutoring success rates.
DICT evaluation protocol. evaluated via DICT protocol.
CURA 2025 Code Verbal Process Supervision for Integrates VPS modules with +3.65% on BigCodeBench with
[172] Reason- code understanding and LLM to guide reasoning over o3-mini.
ing reasoning. code.
DARS 2025 Performance Dynamic Action Re-Sampling Branches on execution feedback 55% pass@k and 47% pass@1
[173] Enhance- to branch inference at decision to explore alternative actions. on SWE-Bench Lite (Claude 3.5
ment points. Sonnet V2).
LocAgent 2025 Code Lo- Graph-based code representation Parses code into heterogeneous 92.7% file-level accuracy; +12%
[174] calization for multi-hop localization. graphs for reasoning over GitHub issue resolution.
dependencies.
GateLens 2025 Release NL→Relational-Algebra Automates query translation and 80% reduction in analysis time
[175] Valida- conversion and Python code optimized code for data (automotive software).
tion generation for test-data analysis. processing.
Repo2Run 2025 Env. Con- Atomic Docker setup synthesis Synthesizes and tests 86.0% success on 420 Python
[176] figuration with dual-environment rollback. Dockerfiles; isolates failures via repos; +63.9% vs. baselines.
dual environments.
UXAgent 2025 Usability LLM-agent with browser Generates qualitative insights, Accelerates UX iteration and
[177] Testing connector to simulate thousands action logs, and recordings reduces upfront user
of users. before user studies. recruitment.
SWE-Gym 2024 Training Realistic Python tasks and unit Provides executable +19% resolve rate; 32.0% on
[178] Environ- tests for SWE-agent training. environments with tests and SWE-Bench Verified; 26.0% on
ment natural language descriptions. Lite.
Qwen2.5-xCoder2025 Multi-Agent Multilingual instruction tuning Agents collaborate to generate Outperforms on multilingual
[179] Collabo- via language-specific agents and refine multilingual programming benchmarks.
ration with memory. instructions.
SyncMind 2025 Collaboration Defines and benchmarks Introduces SyncBench with 24 Exposes performance gaps and
[180] Simula- out-of-sync scenarios to k real-world instances. guides improvements.
tion improve agent coordination.
CodeSim 2025 Code Plan verification and I/O Incorporates plan verification SOTA on HumanEval, MBPP,
[181] Genera- simulation for multi-agent and internal debugging via APPS, CodeContests.
tion synthesis & debugging. input/output simulation.
Bench.: Benchmarking; Intgr.: Integration & Deployment; Std.: Standards Compliance; : Partial; : Not Supported; : Supported.
a) Agent Programming Architectures: Dong et al. [169] b) Verification & Supervision Agents: The papers by Jain
explore prompt engineering for large language models (LLMs) et al. [170] , Wang et al.[171], and Chen et al. [172] contribute
from the perspective of automata theory, arguing that LLMs to advancing the use of large language models (LLMs) for real-
can be viewed as automata. They assert that just as automata world software engineering (SWE) tasks, intelligent tutoring,
must be programmed using the languages they accept, LLMs and code generation. Jain et al. [170] introduce AgentGym, a
should similarly be programmed within the scope of both comprehensive environment for training SWE-agents, address-
natural and formal languages. This insight challenges tradi- ing challenges in scalable curation of executable environments
tional software engineering practices, which often distinguish and test-time compute scaling. Their approach leverages SYN-
between programming and natural languages. The paper in- GEN, a synthetic data curation method, and Hybrid Test-time
troduces the Ann Arbor Architecture, a conceptual framework Scaling to improve performance on the SWE-Bench Verified
designed for agent-oriented programming of language models, benchmark, achieving a state-of-the-art pass rate of 51%.
which serves as a higher-level abstraction to enhance in- Wang et al. [171] propose a novel coding tutoring framework,
context learning beyond basic token generation. The authors Trace-and-Verify (TRAVER), combining knowledge tracing
also present Postline, their agent platform, and discuss early and turn-by-turn verification to enhance tutor agents’ guid-
results from experiments conducted to train agents within this ance toward task completion. Their work introduces DICT, a
framework. holistic evaluation protocol for tutoring agents, demonstrating
significant improvements in coding tutoring success rates.
27
SyncMind Multi-Agent
significant advancement in optimizing coding agent perfor-
[180] Collab
Framework
[179] TRAVER mance, reducing the need for extensive manual intervention
CodeSim SWE-Gym & DICT
[181] [178] [171]
and improving overall efficiency.
UXAgent
[177]
d) Code Localization & Software Analytics: The works
Multi-Agent
by Chen et al. [174] and Gholamzadeh et al. [175] contribute
Collab-
oration
significant advancements in the application of Large Language
&
Simulation
Domain-Specific
SWE
Repo2Run
[176] Models (LLMs) to improve software engineering tasks, such
Agents as code localization and release validation. Chen et al. [174]
GateLens
introduce LocAgent, a framework for code localization that
[175]
utilizes graph-based representations of codebases. By parsing
code into directed heterogeneous graphs, LocAgent captures
Code
Localiza- LocAgent
the relationships between various code structures and their
tion & [174]
Software dependencies, enabling more efficient and accurate local-
Analytics
ization through multi-hop reasoning. Their approach, when
applied to real-world benchmarks, demonstrates substantial
Software
Engineering
improvements in localization accuracy, achieving up to 92.7%
on file-level localization and enhancing GitHub issue res-
olution success rates by 12%. In comparison to state-of-
Adaptive
Control & the-art models, LocAgent provides similar performance at a
Performance
Enhance- significantly lower cost. On the other hand, Gholamzadeh
ment
DARS
[173]
et al. [175] present GateLens, an LLM-based tool designed
to improve release validation in safety-critical systems like
automotive software. GateLens automates the analysis of test
Verification data by converting natural language queries into Relational
&
Supervision Algebra expressions and generating optimized Python code,
Agent Pro- Agents
gramming which significantly accelerates data processing. In industrial
Architectures CURA
(VPS)
[172] evaluations, GateLens reduced analysis time by over 80%,
demonstrating strong robustness and generalization across
TRAVER
& DICT
[171]
different query types. This tool improves decision-making in
AgentGym
Ann Arbor
Archi-
[170] safety-critical environments by automating test result analysis,
tecture Postline
[169] Platform
[169]
thereby enhancing the scalability and reliability of software
systems in automotive applications.
Fig. 9: Agent LLM Applications in Software Engineering e) Domain-Specific SWE Agents: Hu et al. [176] in-
troduce Repo2Run, a novel LLM-based agent aimed at au-
tomating the environment configuration process in software
Finally, Chen et al. present CURA, a code understanding and development. Traditional methods for setting up environments
reasoning system augmented with verbal process supervision often involve manual work or rely on fragile scripts, which
(VPS). CURA achieves a 3.65% improvement on benchmarks can lead to inefficiencies and errors. Repo2Run addresses
like BigCodeBench and demonstrates enhanced performance these challenges by fully automating the configuration of
when paired with the o3-mini model. These works collectively Docker containers for Python repositories. The key innova-
push the boundaries of LLM applications in complex software tions of Repo2Run are its atomic configuration synthesis and
engineering tasks, intelligent tutoring, and reasoning-driven a dual-environment architecture, which isolates internal and
code generation. external environments to prevent contamination from failed
c) Adaptive Control & Performance Enhancement: Ag- commands. A rollback mechanism ensures that only fully
garwal et al. [173] introduce Dynamic Action Re-Sampling executed configurations are applied, and the agent generates
(DARS), a novel approach for scaling compute during infer- executable Dockerfiles from successful configurations. Eval-
ence in coding agents, aimed at improving their decision- uated on a benchmark of 420 Python repositories with unit
making capabilities. While existing methods often rely on tests, Repo2Run achieved an impressive success rate of 86.0%,
linear trajectories or random sampling, DARS enhances agent outperforming existing baselines by 63.9%.
performance by branching out at key decision points and Lu et al. [177] developed UXAgent, a tool that uses
selecting alternative actions based on the history of previous LLM-Agent technology and a universal browser connector to
attempts and execution feedback. This enables coding agents simulate thousands of users for automated usability testing.
to recover more effectively from sub-optimal decisions, lead- It enables user experience (UX) researchers to quickly iterate
ing to faster and more efficient problem-solving. The authors on study designs by providing qualitative insights, quantitative
evaluate DARS on the SWE-Bench Lite benchmark, achieving action data, and video recordings before engaging participants.
an impressive pass@k score of 55% with Claude 3.5 Sonnet Wang et al. [171] introduce TRAVER (Trace-and-Verify),
V2 and a pass@1 rate of 47%, surpassing current state-of- a novel agent workflow that combines knowledge tracing
the-art open-source frameworks. This approach provides a estimating a student’s evolving knowledge state with turn-
28
by-turn verification to ensure effective step-by-step guidance over 100 subcategories, and iterative instruction refinement
toward task completion. Alongside TRAVER, they propose via suggester-editor pairs. This process yields a dataset of
DICT, an automatic evaluation protocol that utilizes controlled 25 million prompt-response pairs covering diverse skills such
student simulation and code generation tests to assess the as text editing, coding, creative writing, and reading compre-
performance of tutoring agents holistically. SWE-Gym [178] is hension. When applied to fine-tune a Mistral-7B model, the
introduced as the first dedicated environment for training real- resulting Orca-3 model demonstrated significant performance
world software engineering (SWE) agents, designed around improvements ranging from 19% to 54% across benchmarks
2,438 Python task instances that include complete code- like MMLU, AGIEval, GSM8K, BBH, and AlpacaEval as well
bases, executable runtime environments, unit tests, and natural as a notable reduction in hallucinations for summarization
language task descriptions. This realistic setup allows for tasks. These findings underscore the potential of automated,
training language model–based SWE agents that significantly agentic synthetic data generation to enhance model capabili-
improve performance achieving up to 19% absolute gains in ties while reducing reliance on labor-intensive data curation,
resolve rate on popular test sets like SWE-Bench Verified and positioning AgentInstruct as a promising tool for advancing
Lite. Furthermore, the authors explore inference-time scaling LLM instruction tuning.
by employing verifiers trained on agent trajectories sampled
from SWE-Gym, which, when combined with their fine- MarketSenseAI
Agentic Crews FinSphere
tuned agents, achieve state-of-the-art performance of 32.0% [190]
[189]
[188]
by Yang et al. [179], Guo et al. [180], and Islam et al. [181]
contribute significant advancements to the application of Large Agentic Financial
Modeling Stock Analysis &
Language Models (LLMs) in code understanding, collabora- Citation-Enhanced
CSA
& Risk
Management
Evaluation
Multi-Agent
Financial QA
tive software engineering, and code generation. Yang et al. [191] [186]
Decision-Making
et al. [180] introduce SyncMind, a framework that defines
the out-of-sync problem in collaborative software engineer-
ing. Through their SyncBench benchmark, which includes
over 24,000 instances of out-of-sync scenarios from real-
world codebases, they highlight performance gaps in current TwinMarket
[183]
FinCon
[184]
been shown to further increase accuracy, albeit with higher agent approach significantly boosts performance, with an av-
operational costs. erage increase of 15% for the LLaMA3-8B model and 5% for
b) Market Simulation: Yang et al. [183] introduce Twin- the LLaMA3-70B model, compared to single-agent systems.
Market, a multi-agent framework that harnesses large language Moreover, the proposed system performs comparably to and
models (LLMs) to simulate complex socio-economic systems, sometimes exceeds the capabilities of much larger single-agent
addressing longstanding challenges in modeling human behav- models such as LLaMA3.1-405B and GPT-4o-mini, although
ior. Traditional rule-based agent-based models often fall short it slightly lags behind Claude-3.5 Sonnet.
in capturing the irrational and emotionally driven aspects of f) Stock Analysis and Evaluation: Han et al. [187]
decision-making emphasized in behavioral economics. Twin- present a novel multi-agent collaboration system designed to
Market leverages the cognitive biases and dynamic emotional enhance financial analysis and investment decision-making by
responses inherent in LLMs to create more realistic simula- leveraging the collaborative potential of multiple AI agents.
tions of socio-economic interactions. The study illustrates how Moving beyond traditional single-agent models, the system
individual agent behaviors can lead to emergent phenomena features configurable agent groups with diverse collaboration
such as financial bubbles and recessions when combined structures that dynamically adapt to varying market conditions
through feedback mechanisms through experiments conducted and investment scenarios through a sub-optimal combination
in a simulated stock market environment. strategy. The study focuses on three key sub-tasks funda-
c) Sequential Investment Decision-Making: Yu et al. mentals, market sentiment, and risk analysis applied to the
[184] propose FinCon, an LLM-based multi-agent framework 2023 SEC 10-K forms of 30 companies from the Dow Jones
designed to tackle the complexities of sequential financial Index. Experimental findings reveal significant performance
investment decision-making. Recognizing that effective in- improvements with multi-agent configurations compared to
vestment requires dynamic interaction with volatile environ- single-agent approaches, demonstrating enhanced accuracy,
ments, FinCon draws inspiration from real-world investment efficiency, and adaptability.
firm structures by establishing a manager-analyst communi- In a related study, Han et al. [188] introduce FinSphere,
cation hierarchy. This design facilitates synchronized, cross- a conversational stock analysis agent designed to overcome
functional collaboration through natural language interactions two major challenges faced by current financial LLMs: their
while endowing each agent with enhanced memory capacity. A insufficient depth in stock analysis and the lack of objec-
key component is the risk-control module, which periodically tive metrics for evaluating the quality of analysis reports.
triggers a self-critiquing mechanism to update systematic The authors make three significant contributions. First, they
investment beliefs, thereby reinforcing future agent behavior present Stocksis, a dataset curated by industry experts to
and reducing unnecessary communication overhead. FinCon enhance the stock analysis capabilities of LLMs. Second,
exhibits strong generalization across various financial tasks, they propose Analyscore, a systematic evaluation framework
such as stock trading and portfolio management, and offers a that objectively assesses the quality of stock analysis reports.
promising approach to synthesizing multi-source information Third, they develop FinSphere, an AI agent that leverages
for optimized decision-making in dynamic financial markets. real-time data feeds, quantitative tools, and an instruction-
d) Strategic Behavior in Competitive Markets: Li et al. tuned LLM to generate high-quality stock analysis in response
[185] investigate the strategic behavior of large language to user queries. Experimental results indicate that FinSphere
models (LLMs) when deployed as autonomous agents in outperforms general and domain-specific LLMs and existing
multi-commodity markets within the framework of Cournot agent-based systems, even when these systems are enhanced
competition. The authors examine whether these models can with real-time data and few-shot guidance.
independently engage in anti-competitive practices, such as Fatouros et al. [189] introduce MarketSenseAI, an inno-
collusion or market division, without explicit human interven- vative framework for comprehensive stock analysis that har-
tion. Their findings reveal that LLMs can monopolize specific nesses large language models (LLMs) to integrate diverse
commodities by dynamically adjusting pricing and resource financial data sources ranging from financial news, historical
allocation strategies, thereby maximizing profitability through prices, and company fundamentals to macroeconomic indica-
self-directed strategic decisions. These results present signif- tors. Leveraging a novel architecture that combines Retrieval-
icant challenges and potential opportunities for businesses Augmented Generation with LLM agents, MarketSenseAI
incorporating AI into strategic roles and regulatory bodies processes SEC filings, earnings calls, and institutional reports
responsible for maintaining fair market competition. to enhance macroeconomic analysis. The latest advancements
e) Financial Reasoning and QA: Fatemi et al. [186] in the framework yield significant improvements in funda-
address the limitations of large language models (LLMs) in mental analysis accuracy over its previous iteration. Empirical
financial question-answering (QA) tasks that require complex evaluations on S&P 100 stocks (2023–2024) reveal cumulative
numerical reasoning. Recognizing that multi-step reasoning returns of 125.9% versus the index’s 73.5%, while tests on
is essential for extracting and processing information from S&P 500 stocks in 2024 show a 33.8% higher Sortino ratio,
tables and text, the authors propose a multi-agent framework underscoring the scalability and robustness of this LLM-driven
incorporating a critical agent to evaluate the reasoning process investment strategy.
and final answers. The framework is further enhanced with g) Agentic Financial Modeling and Risk Management:
multiple critic agents specializing in distinct aspects of the Okpala et al. [190] examine integrating large language models
answer evaluation. Experimental results show that this multi- into agentic systems within the financial services industry,
30
focusing on automating complex modeling and model risk into manageable sub-tasks and compiling them into a struc-
management (MRM) tasks. The authors introduce the concept tured memory library that can be referenced and refined in
of agentic crews, where teams of specialized agents, coordi- future queries. The framework incorporates three types of
nated by a manager, collaboratively execute distinct functions. memory and a library-enhanced reasoning component, en-
The modeling crew handles tasks such as exploratory data abling the system to improve over time through experience.
analysis, feature engineering, model selection, hyperparameter Evaluations on four SciBench chemical reasoning datasets
tuning, training, evaluation, and documentation, while the reveal that ChemAgent achieves performance gains of up to
MRM crew focuses on compliance checks, model replication, 46% with GPT-4, significantly outperforming existing methods
conceptual validation, outcome analysis, and documentation. and suggesting promising applications in fields such as drug
The effectiveness and robustness of these agentic workflows discovery and materials science.
are demonstrated through numerical examples applied to b) Materials Discovery & Design: By collaborating with
datasets in credit card fraud detection, credit card approval, materials science experts, Kumbhar et al. [193] curate a novel
and portfolio credit risk modeling, highlighting the potential dataset from recent journal publications that encapsulate real-
for autonomous decision-making in financial applications. world design goals, constraints, and methodologies. Using
h) Trustworthy Conversational Shopping Agents: Zeng this dataset, they test LLM-based agents to generate viable
et al. [191] focuses on enhancing the trustworthiness of LLM- hypotheses to achieve specified objectives under given con-
based Conversational Shopping Agents (CSAs) by addressing straints. To rigorously assess the relevance and quality of these
two key challenges: the generation of hallucinated or unsup- hypotheses, a novel scalable evaluation metric is proposed
ported claims and the lack of knowledge source attribution. To that mirrors the critical assessment process of materials scien-
combat these issues, the authors propose a production-ready tists. Together, the curated dataset, the hypothesis generation
solution that integrates a ”citation experience” through In- method, and the evaluation framework provide a promising
context Learning (ICL) and Multi-UX-Inference (MUI). This foundation for future research to accelerate materials discovery
approach enables CSAs to include citation marks linked to and design using LLM. ChemAgent is a novel framework
relevant product information without disrupting user experi- that aims to enhance chemical reasoning by leveraging large
ence features. Additionally, the work introduces automated language models through a dynamic, self-updating library.
metrics and scalable benchmarks to evaluate the grounding and
attribution capabilities of LLM responses holistically. Exper- Agent Trading
imental results on real-world data indicate that incorporating Arena
[202]
MACM [194] 2024 Advanced Solve multi-step math Multi-Agent MATH level 5
Reasoning problems with robust Conditional Mining accuracy increase from
generalization. prompting for iterative 54.68% to 76.73% on
refinement. GPT-4 Turbo.
MathLearner 2024 Inductive Enhance LLM Retrieval module plus +20.96% global
[195] Reasoning reasoning via procedural knowledge accuracy; solves
inductive retrieval and injection in inductive 17.54% previously
application. loop. unsolved problems.
Prompt Sampling 2024 Search Combine diverse Uniform sampling 43% fewer runs for
[196] Space prompting methods to over multiple prompt MATH-hard with
Expansion expand search space strategies; fewer maximal coverage.
efficiently. inference runs.
KG-Proof Agent 2025 Proof Con- Automate Integrates concept KG 34% success on
[198] struction formalization of with LLM to structure MUSTARDSAUCE;
proofs using lemmas and steps. 2–11% improvement
knowledge graphs. over baselines.
MATHVC [200] 2024 Educational Simulate group Virtual classroom with Realistic dialog;
Modeling discussions for diverse student-agents improves modeling
mathematical and meta planning. task performance.
modeling skills.
PACE [201] 2025 Personalized Tailor math instruction Felder-Silverman Higher engagement
Tutoring to learning styles with personas plus Socratic and outcomes versus
Socratic feedback. method and tailored traditional tutors.
data.
Agent Trading 2025 Numerical Improve numeric Virtual stock game Enhanced geometric
Arena [202] Reasoning inference with visual plus analysis over reasoning; validated
data and reflection. plots and charts. on NASDAQ dataset.
Proof Val.: Proof Validation; Solver Integr.: Solver & Assistant Integration; Notation Sup.: Notation & Formalism Support: : Partial; : Not Supported; : Supported.
collaborative agent systems, theorem proving, and knowledge ing information and applying prior knowledge to new tasks,
integration. Fig. 11 presents a classification of agent LLM the framework significantly outperforms traditional chain-of-
applications for solving mathematical problems. thought approaches. Specifically, it improves global accuracy
a) Mathematical Reasoning and Problem Solving: The by 20.96% and can solve 17.54% of mathematical problems
paper by Lei et al. [194] tackles the challenge of ad- that the baseline fails to address. A key framework component
vanced mathematical problem-solving in large language mod- is its efficient retrieval method, which enables the model to
els (LLMs), where performance significantly declines despite effectively incorporate external knowledge and support math-
recent advancements like GPT-4. While methods such as Tree ematical computations based on explicit written procedures.
of Thought and Graph of Thought have been explored to Lee et al. [196] investigate the limitations of traditional
enhance logical reasoning, they face notable limitations: their single prompting methods in large language models (LLMs)
effectiveness on complex problems is limited, and the need for mathematical reasoning and explore alternative prompting
for custom prompts for each problem restricts generalizability. strategies. It experimentally demonstrates that distinct prompt-
In response, the authors introduce the Multi-Agent System ing methods each probe unique search spaces, a differentiation
for Conditional Mining (MACM) prompting method. MACM that becomes more pronounced with increased problem com-
successfully addresses intricate, multi-step mathematical chal- plexity. To capitalize on this diversity, the study introduces
lenges and exhibits robust generalization across diverse mathe- an efficient sampling process that uniformly combines outputs
matical contexts. Notably, using MACM, the accuracy of GPT- from these varied methods, thereby expanding the overall
4 Turbo on level five problems in the MATH dataset improves search space and achieving improved performance with fewer
markedly from 54.68% to 76.73%, demonstrating its potential inference runs. Notably, for the particularly challenging prob-
to elevate LLM inferential capabilities substantially. lems in the MATH-hard subset, the approach reached maximal
Xie et al. [195] present an agent framework designed to search space utilization with approximately 43% fewer runs
enhance the mathematical reasoning abilities of large lan- compared to individual methods.
guage models (LLMs) through inductive reasoning. Drawing Deng et al. [197] introduce a novel approach to enhance
inspiration from the human learning process of generaliz- the generation of detailed and accurate reasoning traces in
32
large language models (LLMs), particularly for mathemati- to individual learner characteristics. PACE leverages the Felder
cal reasoning tasks. The authors propose an online learning and Silverman learning style model to simulate distinct student
framework termed ”Flows,” where component LLMs work personas, enabling the system to tailor teaching strategies
collaboratively and iteratively, engaging in incremental output to diverse learning styles a crucial factor for enhancing en-
production to build coherent solutions. Central to the approach gagement and comprehension in mathematics. Integrating the
is online Direct Preference Optimization (DPO) with rollouts, Socratic teaching method, PACE provides instant, reflective
which generates DPO pairs for each training example and feedback that encourages deeper cognitive processing and
updates the models in real-time. By directly comparing the critical thinking. The framework also involves constructing
quality of reasoning traces produced by this method against personalized teaching datasets and training specialized mod-
those generated by standard direct model inference, the study els, which facilitate identifying and adapting each student’s
demonstrates that the proposed Flow framework significantly unique needs. Extensive evaluations using multi-aspect criteria
improves LLM performance in mathematical reasoning. demonstrate that PACE outperforms traditional methods in
Li et al. [198] introduce a novel framework that augments personalizing the educational experience and boosting student
large language models (LLMs) with knowledge graphs to motivation and learning outcomes.
improve the construction and formalization of mathematical c) Numerical Reasoning: Ma et al. [202] investigate
proofs. The proposed approach tackles persistent challenges the limitations of large language models (LLMs) in handling
in automating the identification of key mathematical concepts, dynamic and unseen numerical reasoning tasks, mainly when
understanding their relationships, and embedding them within operating on plain-text data. To address this, the authors
rigorous logical frameworks. Experimental results show sig- introduce the Agent Trading Arena a virtual numerical game
nificant performance gains, with the framework achieving up simulating complex economic systems via zero-sum stock
to a 34% success rate on the MUSTARDSAUCE dataset on portfolio investments which better reflects real-world scenarios
o1-mini and consistently outperforming baseline models by where optimal solutions are not clearly defined. Experimental
2–11% across various benchmarks. results indicate that LLMs, including GPT-4o, face challenges
Wang et al. [199] introduce MA-LoT, a novel multi-agent with algebraic reasoning in textual formats, often focusing on
framework designed for the Lean4 theorem proving that it syn- local details at the expense of broader trends. In contrast,
ergizes high-level natural language reasoning with formal lan- when LLMs are provided with visual data representations,
guage verification feedback. Unlike traditional single-agent ap- such as scatter plots or K-line charts, they exhibit significantly
proaches that either generate complete proofs or perform tree enhanced geometric reasoning capabilities. This improvement
searches, MA-LoT leverages structured interactions among is further enhanced by incorporating a reflection module that
multiple agents to maintain long-term coherence and deeper facilitates the analysis and interpretation of complex data.
insight during proof generation. The framework employs a These findings are validated using the NASDAQ Stock dataset,
novel LoT-Transfer Learning training-inference pipeline that underscoring the value of visual inputs for bolstering numer-
harnesses long chain-of-thought processes’ emergent formal ical reasoning in LLMs.
reasoning abilities. Extensive experiments demonstrate that 10) Geography Applications: Yu et al. [203] introduce
MA-LoT achieves a 61.07% accuracy on the Lean4 ver- MineAgent, a modular framework designed to enhance the
sion of the MiniF2F-Test dataset, significantly outperforming capabilities of multimodal large language models (MLLMs)
baselines such as GPT-4 (22.95%), single-agent tree search in the domain of remote-sensing mineral exploration. This
methods (50.70%), and whole-proof generation techniques field presents significant challenges, including the need for
(55.33%). These results underscore the potential of integrating domain-specific geological knowledge and the complexity of
long chain-of-thought reasoning with formal verification to reasoning across multiple remote-sensing images, which is
enhance automated theorem proving. further complicated by long-context issues. MineAgent ad-
b) Educational and Tutoring Applications: Yue et al. dresses these challenges by incorporating hierarchical judg-
[200] introduce MATHVC, a pioneering virtual classroom ing and decision-making modules to improve multi-image
powered by large language models (LLMs) designed to en- reasoning and spatial-spectral integration. In addition, the
hance students’ mathematical modeling (MM) skills through authors propose MineBench, a specialized benchmark to eval-
collaborative group discussions. Recognizing that traditional uate MLLMs on mineral exploration tasks using geological
MM practice often suffers from uneven access to qualified and hyperspectral data. Extensive experiments demonstrate
teachers and resources, the authors leverage LLMs’ capabil- the effectiveness of MineAgent, showcasing its potential to
ities to simulate diverse student characters, each embody- significantly advance the use of MLLMs in the critical area of
ing distinct math-relevant properties. To ensure that these remote-sensing mineral exploration
simulated interactions mirror authentic student discussions, Ning et al. [204] introduce an autonomous geographic
the framework incorporates three key innovations: integrating information system (GIS) agent framework that utilizes large
domain-specific MM knowledge into the simulation, defining language models (LLMs) to perform spatial analyses and
a symbolic schema to ground character behaviors, and em- cartographic tasks. A significant research gap in the field has
ploying a meta planner to guide the conversational flow. been the ability of these agents to autonomously discover
Liu et al. [201] introduce the Personalized Conversational and retrieve the necessary geospatial data. The proposed
Tutoring Agent (PACE) for mathematics instruction, address- framework addresses this by generating, executing, and de-
ing a critical gap in intelligent educational systems by adapting bugging programs to select data sources from a predefined
33
list, using source-specific handbooks that document metadata film production, music and poetry generation, drama scripting,
and retrieval details. The framework is designed in a plug- fashion assistance, and lyric composition. Fig. 12 presents a
and-play style, allowing users or automated crawlers to easily classification of agent LLM applications for Multimedia.
add new data sources by creating additional handbooks. A a) Film Automation Agents: Xu et al. [205] introduce
prototype of the agent has been developed as a QGIS plugin FilmAgent, an innovative LLM-based multi-agent collabora-
and Python program. Experimental results demonstrate its tive framework designed to automate end-to-end film pro-
capability to retrieve data from various sources, including duction within 3D virtual spaces. Virtual film production
OpenStreetMap, U.S. Census Bureau demographic data, satel- involves complex decision-making, including scriptwriting,
lite basemaps from ESRI, global digital elevation models from cinematography, and actor positioning. FilmAgent simulates
OpenTopography, weather data, and COVID-19 case data from various crew roles such as directors, screenwriters, actors,
the NYTimes GitHub. This work is one of the first efforts to and cinematographers, covering crucial stages of the film
create an autonomous GIS agent for geospatial data retrieval, production process. These stages include idea development,
marking a significant advancement in spatial data automation. where brainstormed ideas are transformed into structured
story outlines; scriptwriting, which generates dialogues and
Melody-Lyric character actions; and cinematography, which determines the
Agents
[212] camera setups for each shot. The agents collaborate iteratively,
providing feedback and revisions to verify intermediate scripts
Multi-Agent
Poetry
and reduce hallucinations. Evaluations of the generated videos
Framework on 15 ideas across four key aspects show that FilmAgent
[211]
outperforms all baselines, achieving an average score of 3.98
Lyric
Generation out of 5. Despite using the GPT-4o model, FilmAgent sur-
Agents
passes the single-agent o1, demonstrating the benefits of a
Poetry
Generation MusicAgent coordinated multi-agent system.
[210]
Agents b) Story-to-Video Production Agents: Wang et al. [206]
Music introduce AesopAgent, an Agent-driven Evolutionary Sys-
Under- tem designed for story-to-video production, leveraging the
standing &
Generation advancements in Agent and Artificial Intelligence Generated
Agents
Content (AIGC) technologies. AesopAgent integrates multiple
Symbolic
ComposerX generative capabilities within a unified framework, enabling
[209]
Music users to easily convert story proposals into scripts, images,
Composition
Agents audio, and videos. The system orchestrates the entire video
Multimedia generation workflow, ensuring that the generated content is
Applications
both rich and coherent. The system consists of two layers:
Fashion-Domain
Conver- Fashion the Horizontal Layer and the Utility Layer. The Horizontal
sational
Agents
Assis-
tant
Layer incorporates a novel RAG-based evolutionary system
Eval. that continuously optimizes the video production process by
[208]
Drama
accumulating expert knowledge and refining workflow steps,
Script such as LLM prompt optimization. The Utility Layer provides
Generation
Agents essential tools for consistent image generation, ensuring visual
coherence in terms of composition, characters, and style, while
Story-to-Video
Production
IBSEN also integrating audio and special effects.
[207]
Agents c) Drama Script Generation Agents: Han et al. [207]
Film Au- introduce IBSEN, a director-actor coordination agent frame-
tomation
Agents work designed to generate drama scripts and provide greater
control over the plot development, especially in scenarios
AesopAgent
[206]
where human players are involved. While current language
model agents excel at creating individual behaviors for char-
acters, they often struggle with maintaining consistency and
FilmAgent coherence at the storyline level. IBSEN addresses this by
[205]
introducing a director agent that writes plot outlines based on
Fig. 12: Agent LLM Applications in Multimedia user input, instructs actor agents to role-play their respective
characters, and adjusts the plot as needed to ensure that
11) Multimedia Applications: Multimedia is an emerging the narrative progresses toward the intended objective. The
frontier for LLM-based agents, where creative and interpretive framework was evaluated using a novel drama plot involving
tasks require coordination across diverse modalities, including multiple actor agents, where the interactions were guided by
text, audio, image, and video. In this subsection, we present the director agent. The results demonstrate that IBSEN is ca-
recent advancements in applying agent-based language learn- pable of generating diverse and complete drama scripts from a
ing and machine learning (LLM) systems to domains such as rough plot outline, while preserving the unique characteristics
34
FilmAgent [205] 2025 Film Fully automate Multi-agent roles Outperforms Mean user Virtual studio Exports
Automation end-to-end 3D virtual (director, single-agent baselines score 3.98/5 pipeline MP4/WebM
film production. screenwriter, actors, with coherent video support
cinematographer) across 15 scenarios.
with iterative
feedback loops.
AesopAgent 2024 Story→Video Convert story drafts Two-layer Rich, coherent Workflow Integrates with Supports
[206] into scripts, images, RAG-evolutionary multimodal outputs convergence AIGC asset PNG, WAV,
audio, and video. workflow plus utility with continuous rate ≈ 85 % generators MP4
layer for optimization.
image/audio/effects.
IBSEN [207] 2024 Drama Generate coherent Director agent Diverse, complete Narrative Scriptwriting Plain-text
Scripts drama scripts via outlines plot; actor scripts preserving coherence ¿ toolchain script output
director–actor agents role-play and character traits. 90% (human compatible
coordination. adjust narrative. eval)
Fashion-Agent 2024 Conversational Enhance online LLM front-end 4 000-dialog dataset; Precision@5: E-commerce JSON /
[208] Retail fashion discovery connects to search & improves retrieval 78% API HTML widget
with LLM dialogue recommendation relevance by 18 %. integration
agents. backends.
ComposerX [209] 2024 Music Multi-agent symbolic Agents specialize in Coherent polyphonic Subjective MIDI pipeline Standard
Composi- music generation with melody, harmony, and pieces rated high on rating 4.2/5 plugin MIDI files
tion harmony constraints. structure using LLM musicality.
reasoning.
MusicAgent 2023 Music Orchestrate diverse Autonomous task Simplifies tool use; Task Integrates WAV, MP3,
[210] Processing music tasks via decomposition and reduces development completion FFmpeg, MIDI
unified LLM agent. tool invocation over effort by 40 %. time ↓ 40 % Librosa, Web
HF/GitHub/APIs. APIs
PoetryAgents 2024 Poetry Boost diversity & Cooperative & +3.0–3.7 pp diversity; Distinct Text pipeline UTF-8 text
[211] Generation novelty in non-cooperative agent +5.6–11.3 pp novelty. n-gram ↑ 11% integration
LLM-generated interactions on
poetry via multi-agent GPT-2/3/4.
social learning.
LyricAgents 2024 Lyric Melody-to-lyric Agents for rhyme, Listening test Alignment Singing-synth LRC / JSON
[212] Generation alignment in tonal syllable, alignment & accuracy 85 %. score 0.87 pipeline ready lyric files
languages with consistency; evaluated
multi-agent sub-tasks. via singing synth.
Eval. Metrics: Evaluation Metrics; Pipeline Integr.: Pipeline Integration; Fmt. Compat.: Format Compatibility.
of each character, showing the effectiveness of the framework LLMs have demonstrated impressive performance in STEM
in producing controlled, dynamic narrative content. domains, they often struggle with music composition, par-
d) Fashion-Domain Conversational Agents: Maroniko- ticularly when dealing with long dependencies and harmony
lakis et al. [208] focus on the potential of Large Language constraints. Even when equipped with advanced techniques
Models (LLMs) to revolutionize online fashion retail by en- like In-Context Learning and Chain-of-Thought, LLMs typi-
hancing customer experiences and improving product discov- cally generate poorly structured music. ComposerX aims to
ery through conversational agents. These LLM-powered agents address this by leveraging the reasoning abilities of LLMs
allow customers to interact naturally, refining their needs and their extensive knowledge of music history and theory. By
and receiving personalized fashion and shopping advice. For employing a multi-agent approach, the framework significantly
tasks like finding specific products, conversational agents must enhances the music composition quality of GPT-4. The results
translate customer interactions into calls to various backend show that ComposerX is capable of generating coherent,
systems, such as search engines, to display relevant product polyphonic music compositions with engaging melodies that
options. The authors emphasize the importance of evaluating follow user instructions, marking a substantial improvement in
the capabilities of LLMs in these tasks, particularly in integrat- the application of LLMs to creative music composition tasks.
ing with backend systems. However, existing evaluations are f) Music Understanding & Generation Agents: Yu et al.
often complex due to the lack of high-quality, relevant datasets [210] present MusicAgent, a system designed to streamline
that align with business needs. To address this, the authors AI-powered music processing by organizing and integrat-
developed a multilingual evaluation dataset comprising 4,000 ing diverse music-related tasks. Music processing spans a
conversations between customers and a fashion assistant on a wide range of activities, from generation tasks like timbre
large e-commerce platform. synthesis to comprehension tasks like music classification.
e) Symbolic Music Composition Agents: Deng et al. However, developers and amateurs often struggle to navigate
[209] introduce ComposerX, an agent-based symbolic music the complexity of these tasks, particularly due to the varying
generation framework designed to enhance the music compo- representations of music data and the applicability of different
sition capabilities of Large Language Models (LLMs). While models across platforms. MusicAgent addresses this challenge
35
Allow a diverse selection of MCP Enabling dynamic, multimodal interactions among Enable agents to interface with tools, APIs,
servers to be integrated with various agents without requiring shared memory, and resources using standardized structured
agents. Agent A resources, or tools. Agent B inputs and outputs.
MCP Host MCP Host
A2A Server
A2A Client
Large Language Model (e.g.,
DeepSeek, Qwen, ...etc.)
Remote CrewAI
A2A protocol
MCP MCP MCP Agent Agent MCP MCP MCP
OpenRouter API
Server protocol Client Client protocol Server
A2A Server
A2A Client
Remote A2A protocol LangChain
Agent Agent
A2A Server
Local Data
A2A Client
Local Data Remote Haystack Source 4
A2A Server
A2A Client
Server protocol Client Client protocol Server
Microsoft
Remote A2A protocol AutoGen
Agent Remote
Agent
Remote Service
Service
Front-End Front-End
Fig. 13: Multi-Agent Integration Framework: Enabling dynamic collaboration through the A2A and MCP Protocols.
the creation of AI-native applications and accelerating inno- crosoft AutoGen, which communicate via the A2A protocol.
vation across diverse domains. This communication method allows agents to collaborate
3) Agent-to-Agent Protocol (A2A): In 2025, Google intro- dynamically without sharing internal memories, resources, or
duced the Agent2Agent (A2A) protocol to usher in a new tools, ensuring secure and efficient inter-agent exchanges. In
era of seamless interoperability among AI agents, significantly parallel, the framework utilizes the MCP protocol to stan-
enhancing workplace productivity and automation [215]. The dardize interactions with various tools, APIs, and resources,
protocol is designed to facilitate dynamic collaboration be- enabling agents to connect with both local data sources and
tween autonomous agents, enabling them to work together remote services through structured inputs and outputs.
across isolated data systems and diverse applications regard- Tab. XII provides a comparative analysis of three agent
less of their underlying frameworks or vendors. Using familiar communication protocols: MCP, ACP, and A2A. It highlights
standards such as HTTP, SSE, and JSON-RPC, A2A simplifies their primary purpose, typical setup, core features, and ideal
integration with existing IT infrastructures while also ensuring use cases. MCP (Model Context Protocol) focuses on in-
robust enterprise-grade security through proven authentication tegrating data and tools into LLM workflows, providing a
and authorization practices. A2A supports both swift and standardized interface for delivering context. ACP (Agent
long-duration tasks by allowing agents to exchange real-time Communication Protocol), a component of the BeeAI plat-
updates, negotiate user interface requirements, and perform form, enables communication among multiple agents in a
capability discovery via structured ”Agent Cards. local-first setup, providing tools for agent discovery and
MCP is designed to connect agents with tools, APIs, and telemetry. In contrast, A2A (Agent-to-Agent Protocol) enables
resources through structured inputs and outputs. It is fully interoperability between agents across different frameworks,
supported by Google’s ADK, which enables a wide range of allowing them to exchange tasks and collaborate. The table
MCP servers to be seamlessly integrated with AI agents. In highlights the distinct roles these protocols play in agent-
parallel, A2A 3 provides a dynamic, multimodal framework based systems, with MCP focusing on data integration for
for agent-to-agent communication, allowing different agents to LLMs, ACP concentrating on local agent orchestration, and
collaborate without sharing memory, resources, or tools. Fig. A2A facilitating cross-platform collaboration among agents.
13 presents a sophisticated multi-agent integration framework
that leverages two key protocols A2A and MCP to enable D. Training datasets
seamless interactions among diverse agents and services. It High-quality training datasets are crucial for enhancing
depicts multiple remote agents, including those branded as the reasoning, multilingual understanding, and instruction-
CrewAI Agent, LangChain Agent, Haystack Agent, and Mi- following abilities of large language models. In this subsection,
we present three recently developed datasets: NaturalReason-
3 https://google.github.io/A2A/ ing, FineWeb2, and MagPie-Ultra. Each dataset addresses
37
different aspects of model training, ranging from expanding configurations, FineWeb2 employs innovative techniques such
reasoning across multiple domains to enhancing multilingual as ”re-hydration” deduplication and language-specific filtering
capabilities and advancing the generation of synthetic instruc- to ensure high data quality. Extensive ablation experiments,
tions. conducted with a 1.45 billion-parameter model trained on
1) NaturalReasoning dataset: Scaling reasoning capabili- 30 billion tokens, further validate the dataset’s robustness.
ties beyond traditional domains such as math and coding has In comparative evaluations against established datasets like
been challenging due to the scarcity of diverse, high-quality CC-100, mC4, CulturaX, and HPLT, FineWeb2 consistently
questions. In response, [217] introduces NaturalReasoning a outperforms across diverse languages. Additionally, special-
comprehensive dataset comprising 2.8 million questions that ized evaluations using the FineTasks benchmark on 9 varied
span multiple domains, including STEM fields (like Physics languages underscore its potential for advancing multilingual
and Computer Science), Economics, and Social Sciences, com- natural language processing and retrieval-augmented genera-
plete with reference answers. The dataset is designed not only tion applications.
to serve as a resource for knowledge distillation experiments, 3) MagPie-Ultra dataset: MagPie-Ultra [219] is a synthetic
where it effectively transfers reasoning capabilities from a dataset generated using Meta Llama 3.1 405 B-Instruct FP8,
strong teacher model, but also for unsupervised self-training representing the first open dataset of its kind. It comprises
using external reward models. When training the Llama3.1- 50,000 synthetic instruction pairs, created by prompting the
8B-Instruct model, NaturalReasoning demonstrates superior language model with minimal ”empty” prompts (only initial
scaling effects, achieving notably higher average performance special tokens) that allow it to generate both user queries and
on benchmarks such as MATH, GPQA, and MMLU-Pro corresponding responses auto-regressively. These generated
compared to other datasets. This work highlights the potential pairs, filtered according to the MagPie recipe and refined via
of a large, diverse question dataset to expand the boundaries Argilla distilabel, cover a diverse range of challenging tasks,
of LLM reasoning across a broader range of fields. including coding, mathematics, data analysis, creative writing,
2) FineWeb2 dataset: Hugging Face’s team introduced advice seeking, and brainstorming. In addition to the raw
[218] FineWeb2, a groundbreaking multilingual dataset com- instruction pairs, the dataset includes detailed metadata quality
prising 8TB of meticulously cleaned text data with over and difficulty scores, embeddings, topic labels, and safety as-
3 trillion non-English words drawn from more than 1,000 sessments from tools like ArmorRM and LlamaGuard, which
languages. FineWeb2 supports a total of 1,893 languages, with further support its use in training and evaluating large language
substantial coverage 486 languages include more than 1MB models across complex instruction-following scenarios.
of data and 80 languages boast over 1GB each demonstrating
its extensive linguistic diversity. Built upon 96 snapshots of V. C HALLENGES AND O PEN PROBLEMS
CommonCrawl data spanning 2013 to 2024 and processed As the capabilities of AI agents and large language models
using the ”datatrove” alongside sophisticated filtering code and continue to grow, new challenges and open problems emerge
38
that limit their effectiveness, reliability, and security [220]. In questions, background surveys, inspirations, and hypotheses,
this section, we highlight several critical research directions, across 12 disciplines. Expert validation ensures the reliability
including advancing the reasoning abilities of AI agents, of this framework. By exclusively using papers published in
understanding the failure modes of multi-agent systems, sup- 2024, the study minimizes data contamination from large lan-
porting automated scientific discovery, enabling dynamic tool guage model (LLM) pretraining datasets, revealing that LLMs
integration, reinforcing autonomous search capabilities, and perform notably well in retrieving novel inspirations. This
addressing the vulnerabilities of emerging communication positions LLMs as promising “research hypothesis mines”
protocols. that can facilitate the automation of scientific discovery by
generating innovative hypotheses at scale.
A. AI Agents Reasoning Despite these advances, significant challenges remain for AI
The primary challenge addressed in [221] is the inherent agents employing LLMs to automate scientific discovery. One
limitation of traditional Chain-of-Thought (CoT) methods, key obstacle is ensuring that these agents generate novel and
which only reveal the final reasoning steps without explicitly scientifically valid hypotheses, as they must navigate the risk
modeling the underlying cognitive process that leads to those of producing biased or spurious associations without sufficient
steps. Meta Chain-of-Thought (Meta-CoT) aims to fill this human oversight. Furthermore, the complexity and diversity
gap by capturing and formalizing the latent reasoning that of scientific literature across various disciplines demand that
underlies a Chain-of-Thought (CoT). This involves not only these agents not only understand domain-specific nuances but
generating the visible chain of thought but also understanding also adapt dynamically to evolving research contexts. The risk
the in-context search behavior and iterative reasoning steps of data contamination, particularly when recent publications
that contribute to it. To overcome these challenges, the authors might overlap with pretraining data, further complicates the
explore innovative approaches, including process supervision, extraction of truly innovative insights. In addition, scaling
synthetic data generation, and search algorithms, to produce these systems while preserving transparency, interpretability,
robust Meta-CoTs. Moreover, they propose a concrete train- and ethical standards poses a multifaceted challenge that must
ing pipeline that integrates instruction tuning with linearized be addressed to harness the potential of AI-driven scientific
search traces and reinforcement learning post-training. Open discovery fully.
research questions remain regarding scaling laws, the role of
verifiers, and the discovery of novel reasoning algorithms,
underscoring the complexity and potential of advancing more
human-like reasoning in large language models. D. Dynamic Tool Integration for Autonomous AI Agents
B. Why Do Multi-Agent LLM Systems Fail? Wu et al. [224] introduce Chain-of-Tools, a novel tool
Pan et al. [222] present a critical examination of why learning approach that leverages the robust semantic represen-
multi-agent LLM systems, despite the theoretical benefits of tation capabilities of frozen large language models (LLMs) to
collaboration, continue to underperform compared to their perform tool calling as part of a chain-of-thought reasoning
single-agent counterparts. Through a rigorous study of five process. By utilizing a vast and flexible tool pool that can
open-source frameworks across 150 tasks, the authors enlist include previously unseen tools, this method addresses the
expert human annotators to identify fourteen distinct failure inefficiencies and highlights key challenges, including man-
modes ranging from ignoring task or role specifications and aging vast prompt-based demonstrations. The authors validate
unnecessary repetition, to lapses in memory and flawed verifi- their approach on a range of datasets, including a newly
cation processes. These issues are systematically grouped into constructed dataset, SimpleToolQuestions, as well as GSM8K-
three categories: design and specification shortcomings, inter- XL, FuncQA, and KAMEL, demonstrating that Chain-of-
agent misalignment, and challenges in task verification and Tools outperforms conventional baselines. Additionally, the
termination. Moreover, the study explores interventions such method holds promise for enhancing autonomous AI agents by
as refining agent role definitions and orchestration strategies, enabling them to select and utilize external tools dynamically,
but finds that these measures alone are insufficient; thereby, thereby broadening their capability to solve complex, multi-
it outlines a clear roadmap for future research to address the step tasks independently. This work prompts several questions:
intricate challenges inherent in multi-agent coordination. How can the integration of unseen tools further enhance LLM
adaptability in diverse scenarios? What critical dimensions
of the model output influence effective tool selection, and
C. AI Agents in Automated Scientific Discovery how can they be optimized for greater interpretability? More-
Liu et al. [223] introduce a large-scale benchmark for over, how might this methodology be extended to enable
evaluating the capability of large language models (LLMs) in more robust autonomous decision-making in AI agents facing
generating high-quality scientific research hypotheses. It tack- increasingly complex reasoning challenges? Notably, these
les this gap by focusing on three pivotal sub-tasks: inspiration questions also underscore key challenges such as managing
retrieval, hypothesis composition, and hypothesis ranking. The a huge tool pool, ensuring efficient tool selection, enhancing
authors have developed an automated framework that extracts model interpretability, and integrating autonomous AI agents
key components from scientific papers, including research capable of dynamic, independent operation.
39
E. Empowering LLM Agents with Integrated Search via Rein- Researchers have developed various training and infer-
forcement Learning ence strategies to cultivate these reasoning abilities, including
ReSearch [225] represents a significant step toward endow- inference-time scaling, pure reinforcement learning (for ex-
ing LLM-based agents with the ability to decide autonomously ample, DeepSeek-R1-Zero), supervised fine-tuning combined
when and how to consult external knowledge sources, seam- with reinforcement learning, and distillation-based fine-tuning.
lessly weaving search operations into their reasoning chains Adaptations of Qwen-32B and Llama-based architectures
via reinforcement learning. By framing search as an action- show that a balanced combination of these methods yields
able tokenized operation rather than a separate retrieval step emergent reasoning behaviors while reducing overthinking and
ReSearch trains models like Qwen2.5 through a reward signal verbosity.
that emphasizes final-answer accuracy and adherence to a We also provided a unified comparison of state-of-the-art
structured think/search/result format. This paradigm eliminates benchmarks from 2019 to 2025, together with a taxonomy of
the need for painstakingly annotated reasoning traces and approximately 60 evaluation suites. Our analysis encompasses
yields strong multi-hop question–answering performance and training frameworks, including mixture-of-experts, retrieval-
cross-domain generalization. Yet, several challenges remain augmented generation, and reinforcement learning, as well as
for deploying such agents in the wild: how to scale the ap- architectural enhancements that drive performance improve-
proach to richer, real-time toolsets (e.g., calculators, databases, ments. In addition, we reviewed AI agent frameworks devel-
code execution environments) without blowing up action oped between 2023 and 2025 and illustrated their applications
spaces; how to design more nuanced reward functions that in domains including materials science, biomedical research,
capture partial credit for intermediate reasoning or mitigate synthetic data generation, and financial forecasting.
reward hacking; how to ensure robustness and interpretability Despite these successes, several challenges remain. Key
when agents autonomously interleave reasoning and tool use; open problems include automating multi-step reasoning with-
and how to balance exploration of novel tool sequences against out human oversight, balancing structured guidance with
exploitation of known effective patterns. Addressing these model flexibility, and integrating long-context retrieval at
questions will be crucial for realizing truly versatile, trust- scale. Future research must address these challenges to unlock
worthy LLM agents capable of complex, multi-step problem- the full potential of autonomous AI agents.
solving. Looking ahead, we anticipate an increasing focus on
domain- and application-specific optimization. Early exam-
ples, such as DeepSeek-R1-Distill, Sky-T1, and TinyZero,
F. Vulnerabilities of AI Agents Protocols demonstrate how specialized reasoning systems can achieve
MCP protocol standardizes how AI applications provide a favorable trade-off between performance and computational
context to LLMs. The MCP protocol faces critical vulnera- cost. Continued innovation in training methodologies, model
bilities in Agent AI communications due to its fundamentally architectures, and benchmarking will drive the next generation
decentralized design [216]. Without a central authority over- of high-efficiency, high-accuracy AI reasoning systems.
seeing security, disparate implementation practices can lead to
uneven defenses, making it easier for attackers to exploit weak
links. In particular, the absence of a standardized authentica- R EFERENCES
tion mechanism across different nodes hinders reliable identity
verification, thereby increasing the risk of unauthorized access [1] A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low,
A. Helyar, A. Madry, A. Beutel, A. Carney et al., “Openai o1 system
and potential data breaches. Moreover, deficiencies in robust card,” arXiv preprint arXiv:2412.16720, 2024.
logging and debugging tools further complicate the timely [2] J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen,
detection of anomalies and errors, which is vital for preventing J. Wang, Y. Fan, K. Dang, B. Zhang, X. Wang, Y. Chu, and
and mitigating attacks. Additionally, the complexity inherent in J. Lin, “Qwen2.5-omni technical report,” 2025. [Online]. Available:
https://arxiv.org/abs/2503.20215
managing multi-step, distributed workflows can lead to state [3] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma,
inconsistencies and operational glitches, amplifying the po- P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability
tential impact of a security compromise across interconnected in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948,
2025.
systems. [4] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle,
A. Letman, A. Mathur, A. Schelten, A. Vaughan et al., “The llama 3
herd of models,” arXiv preprint arXiv:2407.21783, 2024.
VI. C ONCLUSION [5] X. Huang, W. Liu, X. Chen, X. Wang, H. Wang, D. Lian, Y. Wang,
R. Tang, and E. Chen, “Understanding the planning of llm agents: A
In this paper, we have surveyed recent advances in the survey,” arXiv preprint arXiv:2402.02716, 2024.
reasoning capabilities of large language models (LLMs) and [6] J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen,
autonomous AI agents and highlighted the benefits of multi- S. Ma, H. Liu et al., “A survey on llm-as-a-judge,” arXiv preprint
arXiv:2411.15594, 2024.
step, intermediate processing for solving complex tasks in
[7] Q. Wang, R. Ding, Z. Chen, W. Wu, S. Wang, P. Xie, and F. Zhao,
advanced mathematics, code generation, and logical reasoning. “Vidorag: Visual document retrieval-augmented generation via dynamic
By exposing their internal reasoning through intermediate iterative reasoning agents,” arXiv preprint arXiv:2502.18017, 2025.
steps, models such as DeepSeek-R1, OpenAI o1 and o3, and [8] Y. Li, Y. Li, X. Wang, Y. Jiang, Z. Zhang, X. Zheng, H. Wang, H.-T.
Zheng, P. Xie, P. S. Yu et al., “Benchmarking multimodal retrieval
GPT-4o achieve greater accuracy and reliability compared to augmented generation with dynamic vqa dataset and self-adaptive
direct-response approaches. planning agent,” arXiv preprint arXiv:2411.02937, 2024.
40
[9] H. Q. Yu and F. McQuade, “Rag-kg-il: A multi-agent hybrid framework [32] A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y.-X. Wang,
for reducing hallucinations and enhancing llm reasoning through rag “Language agent tree search unifies reasoning acting and planning in
and incremental knowledge graph learning integration,” arXiv preprint language models,” arXiv preprint arXiv:2310.04406, 2023.
arXiv:2503.13514, 2025. [33] H. Su and Others, “Learn-by-interact: A data-centric framework
[10] S. Ateia and U. Kruschwitz, “Bioragent: A retrieval-augmented gener- for self-adaptive agents in realistic environments,” arXiv preprint
ation system for showcasing generative query expansion and domain- arXiv:2501.10893, 2025.
specific search for scientific q&a,” arXiv preprint arXiv:2412.12358, [34] M. Hu, P. Zhao, C. Xu, Q. Sun, J. Lou, Q. Lin, P. Luo, and S. Ra-
2024. jmohan, “Agentgen: Enhancing planning abilities for large language
[11] H. Shimadzu, T. Utsuro, and D. Kitayama, “Retrieval-augmented model based agent via environment and task generation,” arXiv preprint
simulacra: Generative agents for up-to-date and knowledge-adaptive arXiv:2408.00764, 2024.
simulations,” arXiv preprint arXiv:2503.14620, 2025. [35] A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y. Dong, and J. Tang,
[12] G. Xiong, Q. Jin, X. Wang, Y. Fang, H. Liu, Y. Yang, F. Chen, Z. Song, “Agenttuning: Enabling generalized agent abilities for llms,” arXiv
D. Wang, M. Zhang et al., “Rag-gym: Optimizing reasoning and search preprint arXiv:2310.12823, 2023.
agents with process supervision,” arXiv preprint arXiv:2502.13957, [36] C. Gulcehre, T. L. Paine, S. Srinivasan, K. Konyushkova, L. Weerts,
2025. A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Gu et al., “Re-
[13] M. A. Ferrag, N. Tihanyi, and M. Debbah, “Reasoning beyond limits: inforced self-training (rest) for language modeling,” arXiv preprint
Advances and open problems for llms,” 2025. [Online]. Available: arXiv:2308.08998, 2023.
https://arxiv.org/abs/2503.22732
[37] R. Aksitov, S. Miryoosefi, Z. Li, D. Li, S. Babayan, K. Kopparapu,
[14] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
Z. Fisher, R. Guo, S. Prakash, P. Srinivasan et al., “Rest meets react:
D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4
Self-improvement for multi-step reasoning llm agent,” arXiv preprint
technical report,” arXiv preprint arXiv:2303.08774, 2023.
arXiv:2312.10003, 2023.
[15] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut,
J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gem- [38] T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest,
ini: a family of highly capable multimodal models,” arXiv preprint and X. Zhang, “Large language model based multi-agents: A survey
arXiv:2312.11805, 2023. of progress and challenges,” arXiv preprint arXiv:2402.01680, 2024.
[16] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, [39] A. Goldie, A. Mirhoseini, H. Zhou, I. Cai, and C. D. Manning,
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama “Synthetic data generation & multi-step rl for reasoning & tool use,”
2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv preprint arXiv:2504.04736, 2025.
arXiv:2307.09288, 2023. [40] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang,
[17] S. Schmidgall, Y. Su, Z. Wang, X. Sun, J. Wu, X. Yu, J. Liu, Z. Liu, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou et al., “Metagpt: Meta
and E. Barsoum, “Agent laboratory: Using llm agents as research programming for multi-agent collaborative framework,” arXiv preprint
assistants,” arXiv preprint arXiv:2501.04227, 2025. arXiv:2308.00352, vol. 3, no. 4, p. 6, 2023.
[18] A. Ajith, M. Xia, A. Chevalier, T. Goyal, D. Chen, and T. Gao, [41] C. Qian, X. Cong, C. Yang, W. Chen, Y. Su, J. Xu, Z. Liu, and
“Litsearch: A retrieval benchmark for scientific literature search,” arXiv M. Sun, “Communicative agents for software development,” arXiv
preprint arXiv:2407.18940, 2024. preprint arXiv:2307.07924, vol. 6, no. 3, 2023.
[19] H. Kang and C. Xiong, “Researcharena: Benchmarking llms’ ability [42] Z. Mandi, S. Jain, and S. Song, “Roco: Dialectic multi-robot col-
to collect and organize information as research agents,” arXiv preprint laboration with large language models,” in 2024 IEEE International
arXiv:2406.10291, 2024. Conference on Robotics and Automation (ICRA). IEEE, 2024, pp.
[20] J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang, “Researchagent: 286–299.
Iterative research idea generation over scientific literature with large [43] H. Zhang, W. Du, J. Shan, Q. Zhou, Y. Du, J. B. Tenenbaum, T. Shu,
language models,” arXiv preprint arXiv:2404.07738, 2024. and C. Gan, “Building cooperative embodied agents modularly with
[21] M. Gridach, J. Nanavati, K. Z. E. Abidine, L. Mendes, and C. Mack, large language models,” arXiv preprint arXiv:2307.02485, 2023.
“Agentic ai for scientific discovery: A survey of progress, challenges, [44] J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and
and future directions,” arXiv preprint arXiv:2503.08979, 2025. M. S. Bernstein, “Generative agents: Interactive simulacra of human
[22] Y. Kim, C. Park, H. Jeong, Y. S. Chan, X. Xu, D. McDuff, H. Lee, behavior,” in Proceedings of the 36th annual acm symposium on user
M. Ghassemi, C. Breazeal, H. Park et al., “Mdagents: An adaptive interface software and technology, 2023, pp. 1–22.
collaboration of llms for medical decision-making,” Advances in Neural [45] B. Xiao, Z. Yin, and Z. Shan, “Simulating public administration crisis:
Information Processing Systems, vol. 37, pp. 79 410–79 452, 2024. A novel generative agent-based simulation system to lower technology
[23] S. Mukherjee, P. Gamble, M. S. Ausin, N. Kant, K. Aggarwal, barriers in social science research,” arXiv preprint arXiv:2311.06957,
N. Manjunath, D. Datta, Z. Liu, J. Ding, S. Busacca et al., “Polaris: 2023.
A safety-focused llm constellation architecture for healthcare,” arXiv [46] S. Wang, C. Liu, Z. Zheng, S. Qi, S. Chen, Q. Yang, A. Zhao,
preprint arXiv:2403.13313, 2024. C. Wang, S. Song, and G. Huang, “Avalon’s game of thoughts: Battle
[24] T. Yuan, Z. He, L. Dong, Y. Wang, R. Zhao, T. Xia, L. Xu, B. Zhou, against deception through recursive contemplation,” arXiv preprint
F. Li, Z. Zhang et al., “R-judge: Benchmarking safety risk awareness arXiv:2310.01320, 2023.
for llm agents,” arXiv preprint arXiv:2401.10019, 2024. [47] Y. Wang, W. Zhong, Y. Huang, E. Shi, M. Yang, J. Chen, H. Li, Y. Ma,
[25] W. YAN, J. HU, H. ZENG, M. LIU, and W. LIANG, “The application Q. Wang, and Z. Zheng, “Agents in software engineering: Survey,
of large language models in primary healthcare services and the landscape, and vision,” arXiv preprint arXiv:2409.09030, 2024.
challenges,” Chinese General Practice, vol. 28, no. 01, p. 1, 2025.
[48] H. Jin, L. Huang, H. Cai, J. Yan, B. Li, and H. Chen, “From llms to llm-
[26] H. Yu, J. Zhou, L. Li, S. Chen, J. Gallifant, A. Shi, X. Li, W. Hua,
based agents for software engineering: A survey of current, challenges
M. Jin, G. Chen et al., “Aipatient: Simulating patients with ehrs and llm
and future,” arXiv preprint arXiv:2408.02479, 2024.
powered agentic workflow,” arXiv preprint arXiv:2409.18924, 2024.
[27] S. Schmidgall, R. Ziaei, C. Harris, E. Reis, J. Jopling, and M. Moor, [49] A. Singh, A. Ehtesham, S. Kumar, and T. T. Khoei, “Agentic retrieval-
“Agentclinic: a multimodal agent benchmark to evaluate ai in simulated augmented generation: A survey on agentic rag,” arXiv preprint
clinical environments,” arXiv preprint arXiv:2405.07960, 2024. arXiv:2501.09136, 2025.
[28] W. Wang, Z. Ma, Z. Wang, C. Wu, W. Chen, X. Li, and Y. Yuan, “A [50] A. Yehudai, L. Eden, A. Li, G. Uziel, Y. Zhao, R. Bar-Haim, A. Cohan,
survey of llm-based agents in medicine: How far are we from baymax?” and M. Shmueli-Scheuer, “Survey on evaluation of llm-based agents,”
arXiv preprint arXiv:2502.11211, 2025. 2025. [Online]. Available: https://arxiv.org/abs/2503.16416
[29] X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji, “Exe- [51] Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y. Zhou,
cutable code actions elicit better llm agents,” in Forty-first International T. Gao, and W. Che, “Towards reasoning era: A survey of long
Conference on Machine Learning, 2024. chain-of-thought for reasoning large language models,” arXiv preprint
[30] N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, arXiv:2503.09567, 2025.
“Reflexion: Language agents with verbal reinforcement learning,” [52] B. Yan, X. Zhang, L. Zhang, L. Zhang, Z. Zhou, D. Miao, and C. Li,
Advances in Neural Information Processing Systems, vol. 36, pp. 8634– “Beyond self-talk: A communication-centric survey of llm-based multi-
8652, 2023. agent systems,” arXiv preprint arXiv:2502.14321, 2025.
[31] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, [53] X. Feng, L. Dou, E. Li, Q. Wang, H. Wang, Y. Guo, C. Ma, and
“React: Synergizing reasoning and acting in language models,” in L. Kong, “A survey on large language model-based social agents in
International Conference on Learning Representations (ICLR), 2023. game-theoretic scenarios,” arXiv preprint arXiv:2412.03920, 2024.
41
[54] C. Zhang, S. He, J. Qian, B. Li, L. Li, S. Qin, Y. Kang, M. Ma, G. Liu, [76] M. Kazemi, B. Fatemi, H. Bansal, J. Palowitch, C. Anastasiou, S. V.
Q. Lin et al., “Large language model-brained gui agents: A survey,” Mehta, L. K. Jain, V. Aglietti, D. Jindal, P. Chen et al., “Big-bench
arXiv preprint arXiv:2411.18279, 2024. extra hard,” arXiv preprint arXiv:2502.19187, 2025.
[55] Y. Li, H. Wen, W. Wang, X. Li, Y. Yuan, G. Liu, J. Liu, W. Xu, [77] K. Zhu, H. Du, Z. Hong, X. Yang, S. Guo, Z. Wang, Z. Wang, C. Qian,
X. Wang, Y. Sun et al., “Personal llm agents: Insights and sur- X. Tang, H. Ji et al., “Multiagentbench: Evaluating the collaboration
vey about the capability, efficiency and security,” arXiv preprint and competition of llm agents,” arXiv preprint arXiv:2503.01935, 2025.
arXiv:2401.05459, 2024. [78] G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom, “Gaia:
[56] M. C. Ramos, C. J. Collison, and A. D. White, “A review of large a benchmark for general ai assistants,” in The Twelfth International
language models and autonomous agents in chemistry,” Chemical Conference on Learning Representations, 2023.
Science, 2025. [79] R. A. Dubniczky, K. Z. Horvát, T. Bisztray, M. A. Ferrag, L. C.
[57] C. J. Wang, D. Lee, C. Menghini, J. Mols, J. Doughty, A. Khoja, Cordeiro, and N. Tihanyi, “Castle: Benchmarking dataset for static
J. Lynch, S. Hendryx, S. Yue, and D. Hendrycks, “Enigmaeval: A code analyzers and llms towards cwe detection,” arXiv preprint
benchmark of long multimodal reasoning challenges,” arXiv preprint arXiv:2503.09433, 2025.
arXiv:2502.08859, 2025. [80] J. Yao, K. Wang, R. Hsieh, H. Zhou, T. Zou, Z. Cheng, Z. Wang, and
[58] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and P. Viswanath, “Spin-bench: How well do llms plan strategically and
J. Steinhardt, “Measuring massive multitask language understanding,” reason socially?” arXiv preprint arXiv:2503.12349, 2025.
arXiv preprint arXiv:2009.03300, 2020. [81] S. Yao, N. Shinn, P. Razavi, and K. Narasimhan, “A benchmark
[59] L. Zhong, Z. Du, X. Zhang, H. Hu, and J. Tang, “Complexfuncbench: for tool-agent-user interaction in real-world domains,” arXiv preprint
Exploring multi-step and constrained function calling under long- arXiv:2406.12045, 2024.
context scenario,” arXiv preprint arXiv:2501.10132, 2025. [82] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner,
[60] L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, S. Shi, M. Choi, “Drop: A reading comprehension benchmark requiring discrete reason-
A. Agrawal, A. Chopra et al., “Humanity’s last exam,” arXiv preprint ing over paragraphs,” arXiv preprint arXiv:1903.00161, 2019.
arXiv:2501.14249, 2025. [83] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang,
[61] DeepMind, “Facts & grounding: A new benchmark for evaluating D. Song, and J. Steinhardt, “Measuring mathematical problem solving
the factuality of large language models,” 2023, accessed: 2025- with the math dataset,” arXiv preprint arXiv:2103.03874, 2021.
02-03. [Online]. Available: https://deepmind.google/discover/blog/ [84] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan,
facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-\ H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
large-language-models/ language models trained on code,” arXiv preprint arXiv:2107.03374,
[62] C. Zheng, Z. Zhang, B. Zhang, R. Lin, K. Lu, B. Yu, D. Liu, J. Zhou, 2021.
and J. Lin, “Processbench: Identifying process errors in mathematical
[85] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W.
reasoning,” arXiv preprint arXiv:2412.06559, 2024.
Chung, Y. Tay, S. Ruder, D. Zhou et al., “Language models are multi-
[63] L. Ouyang, Y. Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, lingual chain-of-thought reasoners,” arXiv preprint arXiv:2210.03057,
Z. Zhao, M. Jiang, X. Zhao et al., “Omnidocbench: Benchmarking 2022.
diverse pdf document parsing with comprehensive annotations,” arXiv
[86] V. Samuel, H. P. Zou, Y. Zhou, S. Chaudhari, A. Kalyan, T. Rajpurohit,
preprint arXiv:2412.07626, 2024.
A. Deshpande, K. Narasimhan, and V. Murahari, “Personagym: Evalu-
[64] M. Zhuge, C. Zhao, D. Ashley, W. Wang, D. Khizbullin, Y. Xiong,
ating persona agents and llms,” arXiv preprint arXiv:2407.18416, 2024.
Z. Liu, E. Chang, R. Krishnamoorthi, Y. Tian et al., “Agent-as-a-judge:
Evaluate agents with agents,” arXiv preprint arXiv:2410.10934, 2024. [87] C. Ye, Z. Hu, Y. Deng, Z. Huang, M. D. Ma, Y. Zhu, and W. Wang,
“Mirai: Evaluating llm agents for event forecasting,” arXiv preprint
[65] S. Tan, S. Zhuang, K. Montgomery, W. Y. Tang, A. Cuadron, C. Wang,
arXiv:2407.01231, 2024.
R. A. Popa, and I. Stoica, “Judgebench: A benchmark for evaluating
llm-based judges,” arXiv preprint arXiv:2410.12784, 2024. [88] H. Trivedi, T. Khot, M. Hartmann, R. Manku, V. Dong, E. Li, S. Gupta,
[66] OpenAI, “Introducing simpleqa,” 2024, accessed: 2025-02-03. A. Sabharwal, and N. Balasubramanian, “Appworld: A controllable
[Online]. Available: https://openai.com/index/introducing-simpleqa/ world of apps and people for benchmarking interactive coding agents,”
arXiv preprint arXiv:2407.18901, 2024.
[67] HuggingFaceFW, “Fine tasks,” 2024, accessed: 2025-02-03.
[Online]. Available: https://huggingface.co/spaces/HuggingFaceFW/ [89] X. Liu, T. Zhang, Y. Gu, I. L. Iong, Y. Xu, X. Song, S. Zhang, H. Lai,
blogpost-fine-tasks X. Liu, H. Zhao et al., “Visualagentbench: Towards large multimodal
[68] S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stam- models as visual foundation agents,” arXiv preprint arXiv:2408.06327,
bler, S. Upadhyay, and M. Faruqui, “Fact, fetch, and reason: A 2024.
unified evaluation of retrieval-augmented generation,” arXiv preprint [90] Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao,
arXiv:2409.12941, 2024. C. Wei, Z. Lu et al., “Scienceagentbench: Toward rigorous assessment
[69] Hugging Face, “Dabstep,” 2025, accessed: 2025-02-03. [Online]. of language agents for data-driven scientific discovery,” arXiv preprint
Available: https://huggingface.co/blog/dabstep arXiv:2410.05080, 2024.
[70] H. Mao, C. C.-J. Ji, F. Yan, T. Zhang, and S. G. Patil, “Bfcl v2 [91] Z. Zhang, S. Cui, Y. Lu, J. Zhou, J. Yang, H. Wang, and M. Huang,
live,” https://gorilla.cs.berkeley.edu/blogs/12 bfcl v2 live.html, 2024, “Agent-safetybench: Evaluating the safety of llm agents,” arXiv
accessed: February 16, 2025. preprint arXiv:2412.14470, 2024.
[71] S. Miserendino, M. Wang, T. Patwardhan, and J. Heidecke, [92] B. P. Majumder, H. Surana, D. Agarwal, B. D. Mishra, A. Meena,
“Swe-lancer: Can frontier llms earn $1 million from real world A. Prakhar, T. Vora, T. Khot, A. Sabharwal, and P. Clark, “Discov-
freelance software engineering?” 2025. [Online]. Available: https: erybench: Towards data-driven discovery with large language models,”
//arxiv.org/abs/2502.12115 arXiv preprint arXiv:2407.01725, 2024.
[72] X. Yang, K. Sun, H. Xin, Y. Sun, N. Bhalla, X. Chen, S. Choudhary, [93] K. Gu, R. Shang, R. Jiang, K. Kuang, R.-J. Lin, D. Lyu, Y. Mao, Y. Pan,
R. D. Gui, Z. W. Jiang, Z. Jiang et al., “Crag–comprehensive rag T. Wu, J. Yu et al., “Blade: Benchmarking language model agents for
benchmark,” arXiv preprint arXiv:2406.04744, 2024. data-driven science,” arXiv preprint arXiv:2408.09667, 2024.
[73] M. Kouremetis, M. Dotter, A. Byrne, D. Martin, E. Michalak, [94] J. Liu, W. Wang, Z. Ma, G. Huang, Y. SU, K.-J. Chang, W. Chen, H. Li,
G. Russo, M. Threet, and G. Zarrella, “Occult: Evaluating large L. Shen, and M. Lyu, “Medchain: Bridging the gap between llm agents
language models for offensive cyber operation capabilities,” 2025. and clinical practice through interactive sequential benchmarking,”
[Online]. Available: https://arxiv.org/abs/2502.15797 arXiv preprint arXiv:2412.01605, 2024.
[74] N. Tihanyi, T. Bisztray, R. A. Dubniczky, R. Toth, B. Borsos, B. Cherif, [95] Q. Long, Z. Li, R. Gong, Y. N. Wu, D. Terzopoulos, and X. Gao,
R. Jain, L. Muzsai, M. A. Ferrag, R. Marinelli et al., “Dynamic “Teamcraft: A benchmark for multi-modal multi-agent systems in
intelligence assessment: Benchmarking llms on the road to agi with minecraft,” arXiv preprint arXiv:2412.05255, 2024.
a focus on model confidence,” in 2024 IEEE International Conference [96] M. Andriushchenko, A. Souly, M. Dziemian, D. Duenas, M. Lin,
on Big Data (BigData). IEEE, 2024, pp. 3313–3321. J. Wang, D. Hendrycks, A. Zou, Z. Kolter, M. Fredrikson et al.,
[75] N. Tihanyi, M. A. Ferrag, R. Jain, T. Bisztray, and M. Debbah, “Agentharm: A benchmark for measuring harmfulness of llm agents,”
“Cybermetric: a benchmark dataset based on retrieval-augmented gen- arXiv preprint arXiv:2410.09024, 2024.
eration for evaluating llms in cybersecurity knowledge,” in 2024 IEEE [97] H. Li, J. Chen, J. Yang, Q. Ai, W. Jia, Y. Liu, K. Lin, Y. Wu, G. Yuan,
International Conference on Cyber Security and Resilience (CSR). Y. Hu et al., “Legalagentbench: Evaluating llm agents in legal domain,”
IEEE, 2024, pp. 296–302. arXiv preprint arXiv:2412.17259, 2024.
42
[98] D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, [120] M. S. Rashid, C. Bock, Y. Zhuang, A. Buccholz, T. Esler, S. Valentin,
J. Michael, and S. R. Bowman, “Gpqa: A graduate-level google-proof L. Franceschi, M. Wistuba, P. T. Sivaprasad, W. J. Kim, A. Deoras,
q&a benchmark,” in First Conference on Language Modeling, 2024. G. Zappella, and L. Callot, “Swe-polybench: A multi-language bench-
[99] X. Tang, D. Shao, J. Sohn, J. Chen, J. Zhang, J. Xiang, F. Wu, Y. Zhao, mark for repository level evaluation of coding agents,” 2025.
C. Wu, W. Shi et al., “Medagentsbench: Benchmarking thinking models [121] D. Zan, Z. Huang, W. Liu, H. Chen, L. Zhang, S. Xin, L. Chen, Q. Liu,
and agent frameworks for complex medical reasoning,” arXiv preprint X. Zhong, A. Li et al., “Multi-swe-bench: A multilingual benchmark
arXiv:2503.07459, 2025. for issue resolving,” arXiv preprint arXiv:2504.02605, 2025.
[100] Z. Cheng, Y. Tu, R. Li, S. Dai, J. Hu, S. Hu, J. Li, Y. Shi, T. Yu, [122] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch,
W. Chen et al., “Embodiedeval: Evaluate multimodal llms as embodied A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso et al., “Beyond
agents,” arXiv preprint arXiv:2501.11858, 2025. the imitation game: Quantifying and extrapolating the capabilities of
[101] Z. Huang, Z. Wang, S. Xia, X. Li, H. Zou, R. Xu, R.-Z. Fan, language models,” arXiv preprint arXiv:2206.04615, 2022.
L. Ye, E. Chern, Y. Ye et al., “Olympicarena: Benchmarking multi- [123] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung,
discipline cognitive reasoning for superintelligent ai,” Advances in A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou et al., “Challenging
Neural Information Processing Systems, vol. 37, pp. 19 209–19 253, big-bench tasks and whether chain-of-thought can solve them,” arXiv
2024. preprint arXiv:2210.09261, 2022.
[102] Y. Xiang, H. Yan, S. Ouyang, L. Gui, and Y. He, “Scireplicate- [124] Langchain agents tutorial. Accessed: February 23, 2025. [Online].
bench: Benchmarking llms in agent-driven algorithmic reproduction Available: https://python.langchain.com/docs/tutorials/agents/
from research papers,” arXiv preprint arXiv:2504.00255, 2025. [125] Building a basic agent. Accessed: February 23, 2025. [Online].
[103] S. Fish, J. Shephard, M. Li, R. I. Shorrer, and Y. A. Gonczarowski, Available: https://docs.llamaindex.ai/en/stable/understanding/agent/
“Econevals: Benchmarks and litmus tests for llm agents in unknown [126] Crewai. Accessed: February 23, 2025. [Online]. Available: https:
environments,” arXiv preprint arXiv:2503.18825, 2025. //www.crewai.com/
[104] Y. Y. Sung, H. Kim, and D. Zhang, “Verila: A human-centered eval- [127] OpenAI, “swarm,” 2024, accessed: 2025-02-03. [Online]. Available:
uation framework for interpretable verification of llm agent failures,” https://github.com/openai/swarm/tree/main
arXiv preprint arXiv:2503.12651, 2025. [128] S. Hu, M. Ouyang, D. Gao, and M. Z. Shou, “The dawn of gui agent:
[105] Y. Yang, B. Huang, S. Qi, C. Feng, H. Hu, Y. Zhu, J. Hu, H. Zhao, A preliminary case study with claude 3.5 computer use,” arXiv preprint
Z. He, X. Liu et al., “Who’s the mvp? a game-theoretic evaluation arXiv:2411.10323, 2024.
benchmark for modular attribution in llm agents,” arXiv preprint [129] J. Wu, J. Zhu, and Y. Liu, “Agentic reasoning: Reasoning llms with
arXiv:2502.00510, 2025. tools for the deep research,” arXiv preprint arXiv:2502.04644, 2025.
[106] Z. Li, S. Huang, J. Wang, N. Zhang, A. Antoniades, W. Hua, K. Zhu, [130] P. Lu, B. Chen, S. Liu, R. Thapa, J. Boen, and J. Zou, “Octotools: An
S. Zeng, W. Y. Wang, and X. Yan, “Agentorca: A dual-system frame- agentic framework with extensible tools for complex reasoning,” arXiv
work to evaluate language agents on operational routine and constraint preprint arXiv:2502.11271, 2025.
adherence,” arXiv preprint arXiv:2503.08669, 2025.
[131] OpenAI, “Agents sdk,” https://platform.openai.com/docs/guides/
[107] K. Liu, Y. Pan, J. Li, D. He, Y. Xiang, Y. Du, and T. Gao, “Projecteval:
agents-sdk, accessed: March 18, 2025.
A benchmark for programming agents automated evaluation on project-
[132] C. Wang, X. Hu, Y. Zhang, X. Chen, P. Du, Y. Mao, R. Wang,
level code generation,” arXiv preprint arXiv:2503.07010, 2025.
Y. Li, Y. Wu, H. Yang et al., “Starwhisper telescope: Agent-based
[108] D. Gautam, S. Garg, J. Jang, N. Sundaresan, and R. Z. Moghad-
observation assistant system to approach ai astrophysicist,” arXiv
dam, “Refactorbench: Evaluating stateful reasoning in language agents
preprint arXiv:2412.06412, 2024.
through code,” arXiv preprint arXiv:2503.07832, 2025.
[109] Y. Song, K. Thai, C. M. Pham, Y. Chang, M. Nadaf, and M. Iyyer, [133] H. Zhang, Y. Song, Z. Hou, S. Miret, and B. Liu, “Honeycomb: A
“Bearcubs: A benchmark for computer-using web agents,” arXiv flexible llm-based agent system for materials science,” arXiv preprint
preprint arXiv:2503.07919, 2025. arXiv:2409.00135, 2024.
[110] G. Gonzalez-Pumariega, L. S. Yean, N. Sunkara, and S. Choudhury, [134] Z. Wang, Q. Jin, C.-H. Wei, S. Tian, P.-T. Lai, Q. Zhu, C.-P. Day,
“Robotouille: An asynchronous planning benchmark for llm agents,” C. Ross, and Z. Lu, “Geneagent: self-verification language agent for
arXiv preprint arXiv:2502.05227, 2025. gene set knowledge discovery using domain databases,” arXiv preprint
[111] W. Tang, Y. Zhou, E. Xu, K. Cheng, M. Li, and L. Xiao, “Dsg- arXiv:2405.16205, 2024.
bench: A diverse strategic game benchmark for evaluating llm-based [135] M. J. Buehler, “Preflexor: Preference-based recursive language mod-
agents in complex decision-making environments,” arXiv preprint eling for exploratory optimization of reasoning and agentic thinking,”
arXiv:2503.06047, 2025. arXiv preprint arXiv:2410.12375, 2024.
[112] M. Ku, T. Chong, J. Leung, K. Shah, A. Yu, and W. Chen, “The- [136] X. Liang, J. Yang, Y. Wang, C. Tang, Z. Zheng, S. Niu, S. Song,
oremexplainagent: Towards multimodal explanations for llm theorem H. Wang, B. Tang, F. Xiong et al., “Surveyx: Academic survey au-
understanding,” arXiv preprint arXiv:2502.19400, 2025. tomation via large language models,” arXiv preprint arXiv:2502.14776,
[113] J. Yan, Y. Luo, and Y. Zhang, “Refutebench 2.0–agentic benchmark for 2025.
dynamic evaluation of llm responses to refutation instruction,” arXiv [137] L. Li, W. Xu, J. Guo, R. Zhao, X. Li, Y. Yuan, B. Zhang,
preprint arXiv:2502.18308, 2025. Y. Jiang, Y. Xin, R. Dang et al., “Chain of ideas: Revolutionizing
[114] D. Nathani, L. Madaan, N. Roberts, N. Bashlykov, A. Menon, research via novel idea development with llm agents,” arXiv preprint
V. Moens, A. Budhiraja, D. Magka, V. Vorotilov, G. Chaurasia et al., arXiv:2410.13185, 2024.
“Mlgym: A new framework and benchmark for advancing ai research [138] A. Mitra, L. Del Corro, G. Zheng, S. Mahajan, D. Rouhana, A. Co-
agents,” arXiv preprint arXiv:2502.14499, 2025. das, Y. Lu, W.-g. Chen, O. Vrousgos, C. Rosset et al., “Agentin-
[115] D. Zhang, S. Zhoubian, M. Cai, F. Li, L. Yang, W. Wang, T. Dong, struct: Toward generative teaching with agentic flows,” arXiv preprint
Z. Hu, J. Tang, and Y. Yue, “Datascibench: An llm agent benchmark arXiv:2407.03502, 2024.
for data science,” arXiv preprint arXiv:2502.13897, 2025. [139] X. Tang, T. Hu, M. Ye, Y. Shao, X. Yin, S. Ouyang, W. Zhou,
[116] R. Yang, H. Chen, J. Zhang, M. Zhao, C. Qian, K. Wang, Q. Wang, P. Lu, Z. Zhang, Y. Zhao et al., “Chemagent: Self-updating library in
T. V. Koripella, M. Movahedi, M. Li et al., “Embodiedbench: Compre- large language models improves chemical reasoning,” arXiv preprint
hensive benchmarking multi-modal large language models for vision- arXiv:2501.06590, 2025.
driven embodied agents,” arXiv preprint arXiv:2502.09560, 2025. [140] B. Liu, J. Zhang, F. Lin, X. Jia, and M. Peng, “\textit {One Size doesn’t
[117] J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, Fit All}: A personalized conversational tutoring agent for mathematics
A. Tachard, Passos, W. Fedus, and A. Glaese, “Browsecomp: A simple instruction,” arXiv preprint arXiv:2502.12633, 2025.
yet challenging benchmark for browsing agents,” https://cdn.openai. [141] Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong,
com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf, and J.-R. Wen, “A survey on the memory mechanism of large language
2025, accessed: 2025-04-13. model based agents,” arXiv preprint arXiv:2404.13501, 2024.
[118] A. Backlund and L. Petersson, “Vending-bench: A benchmark [142] P. Zhao, H. Zhang, Q. Yu, Z. Wang, Y. Geng, F. Fu, L. Yang, W. Zhang,
for long-term coherence of autonomous agents,” arXiv preprint J. Jiang, and B. Cui, “Retrieval-augmented generation for ai-generated
arXiv:2502.15840, 2025. content: A survey,” arXiv preprint arXiv:2402.19473, 2024.
[119] J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, [143] Y. Liu, S. K. Lo, Q. Lu, L. Zhu, D. Zhao, X. Xu, S. Harrer, and
G. Starace, K. Liu, L. Maksin, T. Patwardhan et al., “Mle-bench: J. Whittle, “Agent design pattern catalogue: A collection of architec-
Evaluating machine learning agents on machine learning engineering,” tural patterns for foundation model based agents,” Journal of Systems
arXiv preprint arXiv:2410.07095, 2025. and Software, vol. 220, p. 112278, 2025.
43
[144] “How to design an agent for production,” ac- [167] S. Hong, Y. Lin, B. Liu, B. Liu, B. Wu, C. Zhang, C. Wei, D. Li,
cessed: 2025-04-14. [Online]. Available: https://blog.langchain.dev/ J. Chen, J. Zhang et al., “Data interpreter: An llm agent for data
how-to-design-an-agent-for-production/ science,” arXiv preprint arXiv:2402.18679, 2024.
[145] Q. Sun, K. Cheng, Z. Ding, C. Jin, Y. Wang, F. Xu, Z. Wu, [168] J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic,
C. Jia, L. Chen, Z. Liu et al., “Os-genesis: Automating gui agent A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno et al., “Towards
trajectory construction via reverse task synthesis,” arXiv preprint an ai co-scientist,” arXiv preprint arXiv:2502.18864, 2025.
arXiv:2412.19723, 2024. [169] W. Dong, “The ann arbor architecture for agent-oriented programming,”
[146] J. Chen, C. Gui, A. Gao, K. Ji, X. Wang, X. Wan, and B. Wang, “Cod, arXiv preprint arXiv:2502.09903, 2025.
towards an interpretable medical agent using chain of diagnosis,” arXiv [170] N. Jain, J. Singh, M. Shetty, L. Zheng, K. Sen, and I. Stoica, “R2e-gym:
preprint arXiv:2407.13301, 2024. Procedural environments and hybrid verifiers for scaling open-weights
[147] Y. Zhou, P. Zhang, M. Song, A. Zheng, Y. Lu, Z. Liu, Y. Chen, and swe agents,” arXiv preprint arXiv:2504.07164, 2025.
Z. Xi, “Zodiac: A cardiologist-level llm framework for multi-agent [171] J. Wang, Y. Dai, Y. Zhang, Z. Ma, W. Li, and J. Chai, “Training turn-
diagnostics,” arXiv preprint arXiv:2410.02026, 2024. by-turn verifiers for dialogue tutoring agents: The curious case of llms
[148] Z. Wang, J. Wu, C. H. Low, and Y. Jin, “Medagent-pro: Towards as your coding tutors,” arXiv preprint arXiv:2502.13311, 2025.
multi-modal evidence-based medical diagnosis via reasoning agentic [172] H.-Y. Chen, C.-P. Huang, and J.-M. Yao, “Verbal process supervision
workflow,” arXiv preprint arXiv:2503.18968, 2025. elicits better coding agents,” arXiv preprint arXiv:2503.18494, 2025.
[149] I. Steenstra, F. Nouraei, and T. W. Bickmore, “Scaffolding empathy: [173] V. Aggarwal, O. Kamal, A. Japesh, Z. Jin, and B. Schölkopf, “Dars:
Training counselors with simulated patients and utterance-level perfor- Dynamic action re-sampling to enhance coding agent performance by
mance visualizations,” arXiv preprint arXiv:2502.18673, 2025. adaptive tree traversal,” arXiv preprint arXiv:2503.14269, 2025.
[150] J. Feng, Q. Zheng, C. Wu, Z. Zhao, Y. Zhang, Y. Wang, and W. Xie, [174] Z. Chen, X. Tang, G. Deng, F. Wu, J. Wu, Z. Jiang, V. Prasanna,
“Mˆ 3builder: A multi-agent system for automated machine learning A. Cohan, and X. Wang, “Locagent: Graph-guided llm agents for code
in medical imaging,” arXiv preprint arXiv:2502.20301, 2025. localization,” arXiv preprint arXiv:2503.09089, 2025.
[151] D. Rose, C.-C. Hung, M. Lepri, I. Alqassem, K. Gashteovski, [175] A. Gholamzadeh Khoee, S. Wang, Y. Yu, R. Feldt, and
and C. Lawrence, “Meddxagent: A unified modular agent frame- D. Parthasarathy, “Gatelens: A reasoning-enhanced llm agent for
work for explainable automatic differential diagnosis,” arXiv preprint automotive software release analytics,” arXiv e-prints, pp. arXiv–
arXiv:2502.19175, 2025. 2503, 2025.
[152] F. Ghezloo, M. S. Seyfioglu, R. Soraki, W. O. Ikezogwo, B. Li, [176] R. Hu, C. Peng, X. Wang, and C. Gao, “An llm-based agent for reliable
T. Vivekanandan, J. G. Elmore, R. Krishna, and L. Shapiro, “Pathfinder: docker environment configuration,” arXiv preprint arXiv:2502.13681,
A multi-modal multi-agent system for medical diagnostic decision- 2025.
making applied to histopathology,” arXiv preprint arXiv:2502.08916, [177] Y. Lu, B. Yao, H. Gu, J. Huang, J. Wang, L. Li, J. Gesi, Q. He, T. J.-
2025. J. Li, and D. Wang, “Uxagent: An llm agent-based usability testing
[153] M. A. Abbasi, F. S. Mirnezami, and H. Naderi, “Hamraz: A culture- framework for web design,” arXiv preprint arXiv:2502.12561, 2025.
based persian conversation dataset for person-centered therapy using
[178] J. Pan, X. Wang, G. Neubig, N. Jaitly, H. Ji, A. Suhr, and Y. Zhang,
llm agents,” arXiv preprint arXiv:2502.05982, 2025.
“Training software engineering agents and verifiers with swe-gym,”
[154] Y. Yang, P. Achananuparp, H. Huang, J. Jiang, K. P. Leng, N. G. Lim,
arXiv preprint arXiv:2412.21139, 2024.
C. T. S. Ern, and E.-p. Lim, “Cami: A counselor agent supporting mo-
[179] J. Yang, W. Zhang, J. Yang, Y. Miao, S. Quan, Z. Wu, Q. Peng, L. Yang,
tivational interviewing through state inference and topic exploration,”
T. Liu, Z. Cui et al., “Multi-agent collaboration for multilingual code
arXiv preprint arXiv:2502.02807, 2025.
instruction tuning,” arXiv preprint arXiv:2502.07487, 2025.
[155] A. Xu, D. Yang, R. Li, J. Zhu, M. Tan, M. Yang, W. Qiu, M. Ma,
H. Wu, B. Li et al., “Autocbt: An autonomous multi-agent framework [180] X. Guo, X. Wang, Y. Chen, S. Li, C. Han, M. Li, and H. Ji,
for cognitive behavioral therapy in psychological counseling,” arXiv “Syncmind: Measuring agent out-of-sync recovery in collaborative
preprint arXiv:2501.09426, 2025. software engineering,” arXiv preprint arXiv:2502.06994, 2025.
[156] J. Lee, K. Lim, Y.-C. Jung, and B.-H. Kim, “Psyche: A multi-faceted [181] M. A. Islam, M. E. Ali, and M. R. Parvez, “Codesim: Multi-agent code
patient simulation framework for evaluation of psychiatric assessment generation and problem solving through simulation-driven planning and
conversational agents,” arXiv preprint arXiv:2501.01594, 2025. debugging,” arXiv preprint arXiv:2502.05664, 2025.
[157] Y. Zhang, X. Yang, X. Li, S. Yu, Y. Luan, S. Feng, D. Wang, [182] X. Wan, H. Deng, K. Zou, and S. Xu, “Enhancing the efficiency and
and Y. Zhang, “Psydraw: A multi-agent multimodal system for accuracy of underlying asset reviews in structured finance: The appli-
mental health screening in left-behind children,” arXiv preprint cation of multi-agent framework,” arXiv preprint arXiv:2405.04294,
arXiv:2412.14769, 2024. 2024.
[158] Z. Du, L. Zheng, R. Hu, Y. Xu, X. Li, Y. Sun, W. Chen, J. Wu, [183] Y. Yang, Y. Zhang, M. Wu, K. Zhang, Y. Zhang, H. Yu, Y. Hu, and
H. Cai, and H. Ying, “Llms can simulate standardized patients via B. Wang, “Twinmarket: A scalable behavioral and socialsimulation for
agent coevolution,” arXiv preprint arXiv:2412.11716, 2024. financial markets,” arXiv preprint arXiv:2502.01506, 2025.
[159] R. Wasenmüller, K. Hilbert, and C. Benzmüller, “Script-based dialog [184] Y. Yu, Z. Yao, H. Li, Z. Deng, Y. Jiang, Y. Cao, Z. Chen, J. Suchow,
policy planning for llm-powered conversational agents: A basic archi- Z. Cui, R. Liu et al., “Fincon: A synthesized llm multi-agent system
tecture for an” ai therapist”,” arXiv preprint arXiv:2412.15242, 2024. with conceptual verbal reinforcement for enhanced financial decision
[160] R. Averly, F. N. Baker, and X. Ning, “Liddia: Language-based intelli- making,” Advances in Neural Information Processing Systems, vol. 37,
gent drug discovery agent,” arXiv preprint arXiv:2502.13959, 2025. pp. 137 010–137 045, 2024.
[161] X. Wang, Y. Zhang, X. Zhang, L. Yu, X. Lin, J. Jiang, B. Ma, and [185] R. Y. Lin, S. Ojha, K. Cai, and M. F. Chen, “Strategic collusion of
K. Yu, “Patentagent: Intelligent agent for automated pharmaceutical llm agents: Market division in multi-commodity competitions,” arXiv
patent analysis,” arXiv preprint arXiv:2410.21312, 2024. preprint arXiv:2410.00031, 2024.
[162] Y. Inoue, T. Song, and T. Fu, “Drugagent: Explainable drug repurposing [186] S. Fatemi and Y. Hu, “Enhancing financial question answering with
agent with large language model-based reasoning,” arXiv preprint a multi-agent reflection framework,” in Proceedings of the 5th ACM
arXiv:2408.13378, 2024. International Conference on AI in Finance, 2024, pp. 530–537.
[163] Z. Chen, Z. Peng, X. Liang, C. Wang, P. Liang, L. Zeng, [187] X. Han, N. Wang, S. Che, H. Yang, K. Zhang, and S. X. Xu, “Enhanc-
M. Ju, and Y. Yuan, “Map: Evaluation and multi-agent enhancement ing investment analysis: Optimizing ai-agent collaboration in financial
of large language models for inpatient pathways,” arXiv preprint research,” in Proceedings of the 5th ACM International Conference on
arXiv:2503.13205, 2025. AI in Finance, 2024, pp. 538–546.
[164] T. Yun, E. Yang, M. Safdari, J. H. Lee, V. V. Kumar, S. S. Mahdavi, [188] S. Han, C. Zhou, Y. Shen, T. Sun, Y. Zhou, X. Wang, Z. Yang, J. Zhang,
J. Amar, D. Peyton, R. Aharony, A. Michaelides et al., “Sleepless and H. Li, “Finsphere: A conversational stock analysis agent equipped
nights, sugary days: Creating synthetic users with health conditions for with quantitative tools based on real-time database,” arXiv preprint
realistic coaching agent interactions,” arXiv preprint arXiv:2502.13135, arXiv:2501.12399, 2025.
2025. [189] G. Fatouros, K. Metaxas, J. Soldatos, and M. Karathanassis, “Mar-
[165] X. Lin, S. Ma, J. Shan, X. Zhang, S. X. Hu, T. Guo, S. Z. Li, and ketsenseai 2.0: Enhancing stock analysis through llm agents,” arXiv
K. Yu, “Biokgbench: A knowledge graph checking benchmark of ai preprint arXiv:2502.00415, 2025.
agent for biomedical science,” arXiv preprint arXiv:2407.00466, 2024. [190] I. Okpala, A. Golgoon, and A. R. Kannan, “Agentic ai systems applied
[166] S. Schmidgall and M. Moor, “Agentrxiv: Towards collaborative au- to tasks in financial services: Modeling and model risk management
tonomous research,” arXiv preprint arXiv:2503.18102, 2025. crews,” arXiv preprint arXiv:2502.05439, 2025.
44
[191] J. Zeng, H. Liu, Z. Dai, X. Tang, C. Luo, S. Varshney, Z. Li, [214] “Beeai now has multiple agents, and a standardized way for
and Q. He, “Cite before you speak: Enhancing context-response them to talk,” accessed: 2025-04-14. [Online]. Available: https:
grounding in e-commerce conversational llm-agents,” arXiv preprint //research.ibm.com/blog/multiagent-bee-ai
arXiv:2503.04830, 2025. [215] “A2A: A New Era of Agent Interoperability,” accessed: 2025-
[192] H. Cho, D. Kim, S. Yang, C. Lee, H. Lee, and J. Choo, “Building 04-14. [Online]. Available: https://developers.googleblog.com/en/
resource-constrained language agents: A korean case study on chemical a2a-a-new-era-of-agent-interoperability/
toxicity information,” arXiv preprint arXiv:2503.17753, 2025. [216] X. Hou, Y. Zhao, S. Wang, and H. Wang, “Model context protocol
[193] S. Kumbhar, V. Mishra, K. Coutinho, D. Handa, A. Iquebal, and (mcp): Landscape, security threats, and future research directions,”
C. Baral, “Hypothesis generation for materials discovery and design arXiv preprint arXiv:2503.23278, 2025.
using goal-driven and constraint-guided llm agents,” arXiv preprint [217] W. Yuan, J. Yu, S. Jiang, K. Padthe, Y. Li, D. Wang, I. Kulikov, K. Cho,
arXiv:2501.13299, 2025. Y. Tian, J. E. Weston et al., “Naturalreasoning: Reasoning in the wild
[194] B. Lei, Y. Zhang, S. Zuo, A. Payani, and C. Ding, “Macm: Utilizing with 2.8 m challenging questions,” arXiv preprint arXiv:2502.13124,
a multi-agent system for condition mining in solving complex mathe- 2025.
matical problems,” arXiv preprint arXiv:2404.04735, 2024. [218] G. Penedo, H. Kydlı́ček, V. Sabolčec, B. Messmer, N. Foroutan,
[195] W. Xie, D. Liu, H. Yan, W. Wu, and Z. Liu, “Mathlearner: A large M. Jaggi, L. von Werra, and T. Wolf, “Fineweb2: A sparkling
language model agent framework for learning to solve mathematical update with 1000s of languages,” Dec. 2024. [Online]. Available:
problems,” arXiv preprint arXiv:2408.01779, 2024. https://huggingface.co/datasets/HuggingFaceFW/fineweb-2
[196] G. Lee, S. Park, J. Park, A. Chung, S. Park, Y. Park, B. Kim, and [219] Argilla, “Magpie ultra v0.1 [dataset],” https://huggingface.co/datasets/
M.-g. Cho, “Expanding search space with diverse prompting agents: argilla/magpie-ultra-v0.1, 2024, accessed: February 16, 2025.
An efficient sampling approach for llm mathematical reasoning,” arXiv [220] C. Costello, S. Guo, A. Goldie, and A. Mirhoseini, “Think, prune,
preprint arXiv:2410.09780, 2024. train, improve: Scaling reasoning without scaling models,” 2025.
[197] Y. Deng and P. Mineiro, “Flow-dpo: Improving llm mathemati- [Online]. Available: https://arxiv.org/abs/2504.18116
cal reasoning through online multi-agent learning,” arXiv preprint [221] V. Xiang, C. Snell, K. Gandhi, A. Albalak, A. Singh, C. Blagden,
arXiv:2410.22304, 2024. D. Phung, R. Rafailov, N. Lile, D. Mahan et al., “Towards system 2
[198] V. Li, Y. Fu, T. Knappe, K. Han, and K. Zhu, “Automating mathemati- reasoning in llms: Learning how to think with meta chain-of-though,”
cal proof generation using large language model agents and knowledge arXiv preprint arXiv:2501.04682, 2025.
graphs,” arXiv preprint arXiv:2503.11657, 2025. [222] M. Z. Pan, M. Cemri, L. A. Agrawal, S. Yang, B. Chopra, R. Tiwari,
[199] R. Wang, R. Pan, Y. Li, J. Zhang, Y. Jia, S. Diao, R. Pi, K. Keutzer, A. Parameswaran, K. Ramchandran, D. Klein et al., “Why
J. Hu, and T. Zhang, “Ma-lot: Multi-agent lean-based long chain-of- do multiagent systems fail?” in ICLR 2025 Workshop on Building Trust
thought reasoning enhances formal theorem proving,” arXiv preprint in Language Models and Applications.
arXiv:2503.03205, 2025. [223] Y. Liu, Z. Yang, T. Xie, J. Ni, B. Gao, Y. Li, S. Tang, W. Ouyang,
[200] M. Yue, W. Lyu, W. Mifdal, J. Suh, Y. Zhang, and Z. Yao, “Mathvc: E. Cambria, and D. Zhou, “Researchbench: Benchmarking llms in
An llm-simulated multi-character virtual classroom for mathematics scientific discovery via inspiration-based task decomposition,” 2025.
education,” arXiv preprint arXiv:2404.06711, 2024. [Online]. Available: https://arxiv.org/abs/2503.21248
[201] B. Liu, J. Zhang, F. Lin, X. Jia, and M. Peng, “One size doesn’t fit all: A [224] M. Wu, T. Zhu, H. Han, X. Zhang, W. Shao, and W. Chen, “Chain-
personalized conversational tutoring agent for mathematics instruction,” of-tools: Utilizing massive unseen tools in the cot reasoning of frozen
2025. [Online]. Available: https://arxiv.org/abs/2502.12633 language models,” arXiv preprint arXiv:2503.16779, 2025.
[202] T. Ma, J. Du, W. Huang, W. Wang, L. Xie, X. Zhong, and J. T. Zhou, [225] M. Chen, T. Li, H. Sun, Y. Zhou, C. Zhu, F. Yang, Z. Zhou, W. Chen,
“Llm knows geometry better than algebra: Numerical understanding of H. Wang, J. Z. Pan et al., “Learning to reason with search for llms via
llm-based agents in a trading arena,” arXiv preprint arXiv:2502.17967, reinforcement learning,” arXiv preprint arXiv:2503.19470, 2025.
2025.
[203] B. Yu, T. Shen, H. Na, L. Chen, and D. Li, “Mineagent: Towards
remote-sensing mineral exploration with multimodal large language
models,” arXiv preprint arXiv:2412.17339, 2024.
[204] H. Ning, Z. Li, T. Akinboyewa, and M. N. Lessani, “An autonomous gis
agent framework for geospatial data retrieval,” International Journal of
Digital Earth, vol. 18, no. 1, p. 2458688, 2025.
[205] Z. Xu, L. Wang, J. Wang, Z. Li, S. Shi, X. Yang, Y. Wang, B. Hu, J. Yu,
and M. Zhang, “Filmagent: A multi-agent framework for end-to-end
film automation in virtual 3d spaces,” arXiv preprint arXiv:2501.12909,
2025.
[206] J. Wang, Z. Du, Y. Zhao, B. Yuan, K. Wang, J. Liang, Y. Zhao, Y. Lu,
G. Li, J. Gao et al., “Aesopagent: Agent-driven evolutionary system on
story-to-video production,” arXiv preprint arXiv:2403.07952, 2024.
[207] S. Han, L. Chen, L.-M. Lin, Z. Xu, and K. Yu, “Ibsen: Director-
actor agent collaboration for controllable and interactive drama script
generation,” arXiv preprint arXiv:2407.01093, 2024.
[208] A. Maronikolakis, A. P. Ramallo, W. Cheng, and T. Kober, “What
should i wear to a party in a greek taverna? evaluation for conversa-
tional agents in the fashion domain,” arXiv preprint arXiv:2408.08907,
2024.
[209] Q. Deng, Q. Yang, R. Yuan, Y. Huang, Y. Wang, X. Liu, Z. Tian,
J. Pan, G. Zhang, H. Lin et al., “Composerx: Multi-agent symbolic
music composition with llms,” arXiv preprint arXiv:2404.18081, 2024.
[210] D. Yu, K. Song, P. Lu, T. He, X. Tan, W. Ye, S. Zhang, and J. Bian,
“Musicagent: An ai agent for music understanding and generation with
large language models,” arXiv preprint arXiv:2310.11954, 2023.
[211] R. Zhang and S. Eger, “Llm-based multi-agent poetry generation
in non-cooperative environments,” arXiv preprint arXiv:2409.03659,
2024.
[212] H.-H. Liu and Y.-W. Liu, “Agent-driven large language models for
mandarin lyric generation,” in 2024 27th Conference of the Orien-
tal COCOSDA International Committee for the Co-ordination and
Standardisation of Speech Databases and Assessment Techniques (O-
COCOSDA). IEEE, 2024, pp. 1–6.
[213] “Introduction to mcp,” accessed: 2025-04-14. [Online]. Available:
https://modelcontextprotocol.io/introduction