0% found this document useful (0 votes)

226 views44 pages

Autonomous AI Agents

This document reviews the evolution of large language models (LLMs) and autonomous AI agents, presenting a comprehensive comparison of benchmarks and a proposed taxonomy for evaluating their performance across various domains. It discusses the integration of LLMs with frameworks for autonomous decision-making, highlights real-world applications, and addresses challenges such as reliability and ethical governance. The paper also outlines future research directions to enhance the capabilities and evaluation of these AI systems.

Uploaded by

Oyoy Taoufik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

226 views44 pages

Autonomous AI Agents

Uploaded by

Oyoy Taoufik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

1

From LLM Reasoning to Autonomous AI Agents:

A Comprehensive Review
Mohamed Amine Ferrag∗¶ , Norbert Tihanyi†‡ , and Merouane Debbah§
∗ Guelma University, Algeria
† Technology Innovation Institute, UAE
‡ Eötvös Loránd University, Hungary
§ Khalifa University of Science and Technology, UAE
¶ Corresponding author: ferrag.mohamedamine@univ-guelma.dz
arXiv:2504.19678v1 [cs.AI] 28 Apr 2025

Abstract—Large language models and autonomous AI agents employing reflection, planning, and multi-agent collaboration
have evolved rapidly, resulting in a diverse array of evaluation has given rise to Agentic RAG systems, which dynamically
benchmarks, frameworks and collaboration protocols. However, orchestrate information retrieval and iterative refinement to
the landscape remains fragmented and lacks a unified taxonomy
or comprehensive survey. Therefore, we present a side-by-side manage complex workflows effectively [11], [12].
comparison of benchmarks developed between 2019 and 2025 Recent advances in large language models have paved the
that evaluate these models and agents across multiple domains. way for highly autonomous AI systems that can independently
In addition, we propose a taxonomy of approximately 60 bench- handle complex research tasks. These systems, often referred
marks that cover general and academic knowledge reasoning, to as agentic AI, can generate hypotheses, conduct literature
mathematical problem-solving, code generation and software
engineering, factual grounding and retrieval, domain-specific reviews, design experiments, analyze data, accelerate scientific
evaluations, multimodal and embodied tasks, task orchestration, discovery, and reduce research costs [13], [14], [15], [16].
and interactive assessments. Furthermore, we review AI-agent Several frameworks, such as LitSearch, ResearchArena, and
frameworks introduced between 2023 and 2025 that integrate Agent Laboratory, have been developed to automate various
large language models with modular toolkits to enable au- research tasks, including citation management and academic
tonomous decision-making and multi-step reasoning. Moreover,
we present real-world applications of autonomous AI agents survey generation [17], [18], [19]. However, challenges per-
in materials science, biomedical research, academic ideation, sist, especially in executing domain-specific literature reviews
software engineering, synthetic data generation, chemical rea- and ensuring the reproducibility and reliability of automated
soning, mathematical problem-solving, geographic information processes [20], [21]. Parallel to these developments in research
systems, multimedia, healthcare, and finance. We then survey automation, large language model-based agents have also
key agent-to-agent collaboration protocols, namely the Agent
Communication Protocol (ACP), the Model Context Protocol begun to transform the medical field [22]. These agents are in-
(MCP), and the Agent-to-Agent Protocol (A2A). Finally, we creasingly used for diagnostic support, patient communication,
discuss recommendations for future research, focusing on ad- and medical education by integrating clinical guidelines, med-
vanced reasoning strategies, failure modes in multi-agent LLM ical knowledge bases, and healthcare systems. Despite their
systems, automated scientific discovery, dynamic tool integration promise, these applications face significant hurdles, including
via reinforcement learning, integrated search capabilities, and
security vulnerabilities in agent protocols. concerns over reliability, reproducibility, ethical governance,
and safety [23], [24], [25]. Addressing these issues is crucial
Index Terms—Large Language Models, Autonomous AI for ensuring that LLM-based agents can be effectively and
Agents, Agentic AI, Reasoning, Benchmarks.
responsibly incorporated into clinical practice, underscoring
the need for comprehensive evaluation frameworks that can
I. I NTRODUCTION reliably measure their performance across various healthcare
Large Language Models (LLMs) such as OpenAI’s GPT- tasks [26], [27], [28].
4 [1], Qwen2.5-Omni [2], DeepSeek-R1 [3], and Meta’s LLM-based agents are emerging as a promising frontier in
LLaMA [4] have transformed AI by enabling human-like text AI, combining reasoning and action to interact with complex
generation and advanced natural language processing, spurring digital environments [29], [30]. Therefore, various approaches
innovation in conversational agents, automated content cre- have been explored to enhance LLM-based agents, from
ation, and real-time translation [5]. Recent enhancements have combining reasoning and acting using techniques like React
extended their utility to multimodal tasks, including text-to- [31] and Monte Carlo Tree Search [32] to synthesizing high-
image and text-to-video generation that broaden the scope quality data with methods like Learn-by-Interact [33], which
of generative AI applications [6]. However, their dependence sidestep assumptions such as state reversals. Other strategies
on static pre-training data can lead to outdated outputs and involve training on human-labeled or GPT-4 distilled data with
hallucinated responses [7], [8], a limitation that Retrieval- systems like AgentGen [34] and AgentTuning [35] to generate
Augmented Generation (RAG) addresses by incorporating trajectory data. At the same time, reinforcement learning
real-time data from knowledge bases, APIs, or the web [9], methods utilize offline algorithms and iterative refinement
[10]. Building on this, the evolution of intelligent agents through reward models and feedback to enhance efficiency
2

and performance in realistic environments [36], [37]. efforts across multiple domains. In this section, we review
LLM-based Multi-Agents harness the collective intelligence the most relevant studies that investigate the integration of
of multiple specialized agents, enabling advanced capabilities LLM-based agents into software engineering, propose agent
over single-agent systems by simulating complex real-world architectures and evaluation frameworks, explore the devel-
environments through collaborative planning, discussion, and opment of multi-agent systems, and examine domain-specific
decision-making. This approach leverages the communicative applications, including healthcare, game-theoretic scenarios,
strengths and domain-specific expertise of LLMs, allowing GUI interactions, personal assistance, scientific discovery, and
distinct agents to interact effectively, much like human teams chemistry.
tackling problem-solving tasks [38], [39]. Recent research
highlights promising applications across various fields, includ- A. LLM-based Agents in Software Engineering
ing software development [40], [41], multi-robot systems [42],
Wang et al. [47] present a survey that bridges Large Lan-
[43], society simulation [44], policy simulation [45], and game
guage Model (LLM)-based agent technologies with software
simulation [46].
engineering (SE). It highlights how LLMs have achieved
The main contributions of this study are:
significant success in various domains and have been in-
• We present a comparative table of benchmarks developed tegrated into SE tasks, often under the agent paradigm,
between 2019 and 2025 that rigorously evaluate large lan- whether explicitly or implicitly. The study presents a structured
guage models and autonomous AI agents across multiple framework for LLM-based agents in SE, comprising three
domains. primary modules: perception, memory, and action. Jin et al.
• We propose a taxonomy of approximately 60 LLM [48] investigate the use of large language models (LLMs)
and AI-agent benchmarks, including general and aca- and LLM-based agents in software engineering, distinguishing
demic knowledge reasoning, mathematical problem solv- between the traditional capabilities of LLMs and the enhanced
ing, code generation and software engineering, fac- functionalities offered by autonomous agents. It highlights the
tual grounding and retrieval, domain-specific evaluations, significant success of LLMs in tasks such as code genera-
multimodal and embodied tasks, task orchestration, and tion and vulnerability detection, while also addressing their
interactive and agentic assessments. limitations, specifically the issues of autonomy and self-
• We present prominent AI-agent frameworks from 2023 improvement that LLM-based agents aim to overcome. The
to 2025 that integrate large language models with mod- paper provides an extensive review of current practices across
ular toolkits, enabling autonomous decision-making and six key domains: requirement engineering, code generation,
multi-step reasoning. autonomous decision-making, software design, test generation,
• We provide applications of autonomous AI agents in and software maintenance. In a complementary study, Jin et
various fields, including materials science and biomedical al. [48] investigate the use of large language models (LLMs)
research, academic ideation and software engineering, and LLM-based agents in software engineering, distinguishing
synthetic data generation and chemical reasoning, mathe- between the traditional capabilities of LLMs and the enhanced
matical problem-solving and geographic information sys- functionalities offered by autonomous agents. It highlights
tems, as well as multimedia, healthcare, and finance. the significant success of LLMs in tasks such as code gen-
• We survey agent-to-agent collaboration protocols, namely eration and vulnerability detection, while also addressing
the Agent Communication Protocol (ACP), the Model their limitations, specifically, issues of autonomy and self-
Context Protocol (MCP), and the Agent-to-Agent Pro- improvement that LLM-based agents aim to overcome. The
tocol (A2A). paper provides an extensive review of current practices across
• We outline recommendations for future research on au- six key domains: requirement engineering, code generation,
tonomous AI agents, specifically advanced reasoning autonomous decision-making, software design, test generation,
strategies, failure modes in multi-agent large language and software maintenance.
model (LLM) systems, automated scientific discovery,
dynamic tool integration via reinforcement learning, inte-
B. Agent Architectures and Evaluation Frameworks
grated search capabilities, and security vulnerabilities in
agent protocols. Singh et al. [49] delves into Agentic Retrieval-Augmented
Generation (Agentic RAG), a sophisticated evolution of tra-
Fig. 1 illustrates the structure of this survey. Section II
ditional Retrieval-Augmented Generation systems that en-
presents the related works. Section III provides a side-by-
hances the capabilities of large language models (LLMs).
side tabular comparison of state-of-the-art LLM and Agentic
While LLMs have transformed AI through human-like text
AI benchmarks. Section IV reviews AI agent frameworks, AI
generation and language understanding, their dependence on
agent applications, AI agent protocols, and training datasets
static training data often results in outdated or imprecise
across various domains. Section V highlights several critical
responses. The paper addresses these limitations by embed-
research directions. Finally, Section VI concludes the paper.
ding autonomous agents within the RAG framework, enabling
dynamic, real-time data retrieval and adaptive workflows.
II. R ELATED W ORKS It details how agentic design patterns such as reflection,
The growing field of autonomous AI agents powered by planning, tool utilization, and multi-agent collaboration equip
large language models has inspired a wide range of research these systems to manage complex tasks and support multi-step
3

LLM and Agentic AI

Introduction Related Works
Benchmarks

How have recent advancements in LLMs What are the related surveys in the field What are the key LLM benchmarks
and agentic AI impacted autonomous AI of LLM-based agents and autonomous AI developed between 2019 and 2025 for
systems, and what are the main systems? evaluating large language models and
contributions of this study? agentic AI systems across various
domains?
LLM-based Agents in Software
Recent advancements in LLMs Engineering
ComplexFuncBench
MMLU benchmark
benchmark
Agent Architectures and Evaluation
Agentic AI Frameworks Humanity's Last
FACTS Grounding
Exam (HLE)
benchmark
benchmark
Multi-Agent Domain-Specific
Collaborative Multi-Agent Systems
Systems Applications ProcessBench OmniDocBench
benchmark Benchmark

Main Contributions Organization of

Comparison with Our Survey
of the Paper the Paper
Agent-as-a-Judge ...

AI Agents Challenges and and Open

Conclusion
Problems

What are the key AI agent frameworks and What are the key challenges and open
What are the key conclusions and future
applications developed between 2024 and problems in advancing AI agents and
directions for large language models
2025 for achieving autonomous decision- large language models?
(LLMs) and autonomous AI agents?
making and dynamic reasoning in real-
world tasks?
AI Agents Reasoning

Key conclusions
AI Agent frameworks Why Do Multi-Agent LLM Systems
Fail?

AI Agents in Automated Scientific

AI Agent applications
Discovery
Challenges

Dynamic Tool Integration for

Autonomous AI Agents
AI Agents protocols

Empowering LLM Agents with

Integrated Search via Reinforcement
Future directions
Learning
Training datasets

Vulnerabilities of AI Agents Protocols

Fig. 1: Survey Structure.

reasoning. The survey offers a comprehensive taxonomy of Moreover, the paper critically highlights existing deficiencies
Agentic RAG architectures, highlights key applications across in the field, notably the need for metrics that more effectively
various sectors, including healthcare, finance, and education, capture cost efficiency, safety, and robustness. In doing so,
and outlines practical implementation strategies. it maps the current landscape of agent evaluation and sets
forth compelling directions for future inquiry, underscoring the
Complementing this architectural perspective, Yehudai et importance of scalable and fine-grained evaluation techniques
al. [50] mark a significant milestone in artificial intelligence in the rapidly evolving AI domain.
by surveying evaluation methodologies for agents powered
by large language models (LLMs). It thoroughly reviews Similarly, Chen et al. [51] focus on Role-Playing Agents
the capabilities of these agents, focusing on core functions (RPAs), a growing class of LLM-based agents that mimic
such as planning, tool utilization, self-reflection, and mem- human behavior across various tasks. Recognizing the inherent
ory, while assessing specialized applications ranging from challenges in evaluating such diverse systems, the authors sys-
web interactions to software engineering and conversational tematically reviewed 1,676 papers published between January
tasks. The authors uncover a clear trend toward developing 2021 and December 2024. Their extensive analysis identifies
more rigorous, dynamically updated evaluation frameworks six key agent attributes, seven task attributes, and seven
by examining both targeted benchmarks for domain-specific evaluation metrics that are prevalent in the current literature.
applications and those designed for more generalist agents. Based on these insights, the paper proposes an evidence-based,
4

TABLE I: An overview of selected surveys on AI Agents.

Theme Reference Year Key Contribution Benchmark AI Agent AI Agent AI Challenges
Frame- Applica- Agents & Open
works tions Protocols Problems

LLM-based Agents Wang et 2024 Survey of LLM-based agent technologies

in Software Engi- al. [47] in SE; proposes a perception–memory–action
neering framework.
LLM-based Agents Jin et al. [48] 2024 Reviews LLM vs. autonomous-agent capabili-
in Software Engi- ties across six SE domains; highlights auton-
neering omy gaps.
Agent Architectures Singh et 2025 Introduces Agentic RAG: embedding au-
& Evaluation al. [49] tonomous agents in RAG with planning, reflec-
tion, tool use, and collaboration.
Agent Architectures Yehudai et 2025 Surveys evaluation methodologies and bench-
& Evaluation al. [50] marks for LLM agents, covering cost, safety,
and robustness.
Agent Architectures Chen et 2025 Analyzes 1,676 RPAs, identifies core attributes,
& Evaluation al. [51] and proposes standardized evaluation guide-
lines.
Multi-Agent Yan et 2025 Comprehensive survey of LLM-powered MAS;
Systems al. [52] focuses on communication, scalability, security,
and multimodality.
Multi-Agent Guo et 2024 Traces evolution from single-agent LLM rea-
Systems al. [38] soning to collaborative MAS; examines profil-
ing and communication.
Healthcare Wang et 2025 Reviews LLM-agent architectures for clinical
al. [28] decision support, documentation, training; dis-
cusses ethics.
Social Agents in Feng et 2024 Surveys LLM-based social agents in game the-
Game Theory al. [53] ory; categorizes frameworks, agent attributes,
and evaluation protocols.
GUI Agents Zhang et 2024 Chronicles evolution of LLM-driven GUI
al. [54] agents; covers multimodal understanding and
large-action models.
Personal LLM Li et al. [55] 2024 Examines personal LLM agents integrating
Agents user data/devices; surveys architectures and se-
curity challenges.
Scientific Discovery Gridach et 2025 Explores Agentic AI in automating research
al. [21] workflows across domains; highlights reliabil-
ity and ethics.
Chemistry Ramos et 2025 Reviews LLM roles in molecule design and
al. [56] synthesis planning; introduces agents for lab
control.
Our Survey Ferrag et al. 2025 Unified end-to-end survey covering bench-
marks, frameworks, applications, protocols, and
challenges.
Not Considered ( ); Partial discussion ( ); Considered ( );

actionable, and generalizable evaluation guideline designed to enabled significant advances in complex problem-solving and
standardize the assessment of RPAs. world simulation. Key aspects of these systems are examined,
including the domains and environments they simulate, the
C. Multi-Agent Systems profiling and communication strategies employed by individ-
Yan et al. [52] provides a comprehensive survey on integrat- ual agents, and the mechanisms that underpin the enhancement
ing LLMs into multi-agent systems (MAS). Their work em- of their collective capacities.
phasizes the communication-centric aspects that enable agents
to engage in both cooperative and competitive interactions,
D. Domain-Specific Applications
thereby tackling tasks that are unmanageable for individual
agents. The paper examines system-level features, internal 1) Healthcare: Wang et al. [28] explores the transforma-
communication mechanisms, and challenges, including scala- tive impact of LLM-based agents on healthcare, presenting
bility, security, and multimodal integration. In a related study, a detailed review of their architectures, applications, and
Guo et al. [38] offer an extensive overview of LLM-based inherent challenges. It dissects the core components of medical
multi-agent systems, charting the evolution from single-agent agent systems, such as system profiles, clinical planning
decision-making to collaborative frameworks that enhance mechanisms, and medical reasoning frameworks, while also
collective problem-solving and world simulation. In a related discussing methods to enhance external capacities. Major
study, Guo et al. [38] provide an extensive overview of large application areas include clinical decision support, medical
language model (LLM)-based multi-agent systems, building documentation, training simulations, and overall healthcare
on the success of LLMs in autonomous planning and reason- service optimization. The survey further evaluates the per-
ing. The authors detail how the evolution from single-agent formance of these agents using established frameworks and
decision-making to collaborative multi-agent frameworks has metrics, identifying persistent challenges such as hallucination
5

management, multimodal integration, and ethical considera- used in the field, offering valuable insights into current prac-
tions. tices. Moreover, the paper critically addresses significant chal-
2) Social Agents in Game-Theoretic Scenarios: Feng et al. lenges, including automating comprehensive literature reviews,
[53] provide a review of research on LLM-based social agents ensuring system reliability, and addressing ethical concerns.
in game-theoretic scenarios. This area has gained prominence It outlines future research directions, emphasizing the im-
for assessing social intelligence in AI systems. The authors portance of human-AI collaboration and improved system
categorize the literature into three main components. First, the calibration.
game framework is examined, highlighting various choice- and 6) Chemistry: Ramos et al. [56] examine the transforma-
communication-focused scenarios. Second, the paper explores tive impact of large language models (LLMs) in chemistry,
the attributes of social agents, examining their preferences, focusing on their roles in molecule design, property prediction,
beliefs, and reasoning capabilities. Third, it discusses evalua- and synthesis optimization. It highlights how LLMs not only
tion protocols incorporating game-agnostic and game-specific accelerate scientific discovery through automation but also
metrics to assess performance. By synthesizing current studies discuss the advent of LLM-based autonomous agents. These
and outlining future research directions, the survey offers agents extend the functionality of LLMs by interfacing with
valuable insights to further the development and systematic their environment and performing tasks such as literature
evaluation of social agents within game-theoretic contexts. scraping, automated laboratory control, and synthesis plan-
3) GUI Agents: Zhang et al. [54] review LLM-brained ning. Expanding the discussion beyond chemistry, the review
GUI agents, marking a paradigm shift in human-computer also considers applications across other scientific domains.
interaction through integrating multimodal LLMs. It traces
the historical evolution of GUI automation, detailing how E. Comparison with Our Survey
advancements in natural language understanding, code gen-
eration, and visual processing have enabled these agents to Table I presents a consolidated view of how existing works
interpret complex graphical user interface (GUI) elements cover key themes, benchmarks, AI agent frameworks, AI agent
and execute multi-step tasks from conversational commands. applications, AI agents protocols, and challenges & open prob-
The survey systematically examines the core components of lems against our survey. While prior studies typically focus
these systems, including existing frameworks, data collection on one or two aspects (e.g., Yehudai et al. [50] on evaluation
and utilization methods for training, and the development of benchmarks, Singh et al. [49] on RAG architectures, Yan et
specialized large-scale action models for GUI tasks. al. [52] on multi-agent communication, or Wang et al. [28] on
4) Personal LLM Agents: Li et al. [55] explore the evolu- domain-specific applications), none integrate the full spectrum
tion of intelligent personal assistants (IPAs) by focusing on of developments in a single, unified treatment. In contrast,
Personal LLM Agents LLM-based agents that deeply inte- our survey is the first to systematically combine state-of-
grate personal data and devices to provide enhanced personal the-art benchmarks, framework design, application domains,
assistance. The authors outline the limitations of traditional communication protocols, and a forward-looking discussion of
IPAs, including insufficient understanding of user intent, task challenges and open problems, thereby providing researchers
planning, and tool utilization, which have hindered their practi- with a comprehensive roadmap for advancing LLM-based
cality and scalability. In contrast, the emergence of foundation autonomous AI agents.
models like LLMs offer new possibilities by leveraging ad-
vanced semantic understanding and reasoning for autonomous III. LLM AND AGENTIC AI B ENCHMARKS
problem-solving. The survey systematically reviews the archi- This section provides a comprehensive overview of bench-
tecture and design choices underlying Personal LLM Agents, marks developed between 2019 and 2025 that rigorously eval-
informed by expert opinions, and examines key challenges uate large language models (LLMs) across diverse and chal-
related to intelligence, efficiency, and security. Furthermore, it lenging domains. For instance, ENIGMAEVAL [57] assesses
comprehensively analyzes representative solutions addressing complex multimodal puzzle-solving by requiring the synthesis
these challenges, laying the groundwork for Personal LLM of textual and visual clues, while ComplexFuncBench [59]
Agents to become a major paradigm in next-generation end- challenges models with multi-step function-calling tasks that
user software. mirror real-world scenarios. Humanity’s Last Exam (HLE)
5) Scientific Discovery: Gridach et al. [21] explore the [60] further raises the bar by presenting expert-level aca-
transformative role of Agentic AI in scientific discovery, demic questions across a broad spectrum of subjects, thereby
underscoring its potential to automate and enhance research reflecting the growing demand for deeper reasoning and
processes. It reviews how these systems, endowed with reason- domain-specific proficiency. Additional frameworks such as
ing, planning, and autonomous decision-making capabilities, FACTS Grounding [61] and ProcessBench [62] scrutinize
are revolutionizing traditional research activities, including the models’ capacities for generating factually accurate long-
literature reviews, hypothesis generation, experimental design, form responses and detecting errors in multi-step reasoning.
and data analysis. The paper highlights recent advancements Meanwhile, innovative evaluation paradigms like Agent-as-a-
across multiple scientific domains, such as chemistry, biology, Judge [64], JudgeBench [65], and CyberMetric [75] provide
and materials science, by categorizing existing Agentic AI granular insights into cybersecurity competencies and error-
systems and tools. It provides a detailed discussion on key detection capabilities. Tables III, II present a comprehensive
evaluation metrics, implementation frameworks, and datasets overview of benchmarks developed between 2024 and 2025.
6

TABLE II: Summary of LLM Benchmarks (Part 1)

Benchmark / Year Evaluation Focus Key Features / Metrics Innovations/Techniques Observations
Dataset

ENIGMAEVAL 2025 Multimodal Contains 1,184 puzzles combining Evaluates multimodal and Pushes models into unstructured,
[57] Reasoning text and images; state-of-the-art long-context reasoning using creative problem-solving scenarios
systems score only ∼7% on standard challenging puzzles from global requiring integration of visual and
puzzles and fail on the hardest ones. competitions. semantic clues.

MMLU 2021 Multitask Comprises 57 diverse tasks (from Assesses broad world knowledge and Designed for general multitask
Benchmark Knowledge elementary math to professional law) problem-solving skills; uncovers language understanding without
[58] testing zero-shot and few-shot calibration challenges and imbalances task-specific fine-tuning.
performance. between procedural and declarative
knowledge.

ComplexFuncBench 2025 Function Calling Evaluates complex function calling Introduces an automatic evaluation Highlights performance differences
[59] tasks with multi-step operations and framework (ComplexEval) for between closed models (e.g., Claude
input lengths up to 128k tokens over function calling, testing reasoning 3.5, GPT-4) and open models (e.g.,
more than 1,000 scenarios. over implicit parameters and Qwen 2.5, Llama 3.1).
constraints.

Humanity’s 2025 Academic Features 3,000 questions spanning Developed through a global Exposes significant performance gaps
Last Exam Reasoning over 100 subjects, including collaborative effort with nearly 1,000 as state-of-the-art LLMs score below
(HLE) [60] multi-modal challenges. experts; includes both multiple-choice 10%, serving as a critical tool for
and short-answer formats with assessing academic reasoning.
verifiable answers.

FACTS 2023 Factual Grounding Contains 1,719 examples requiring Uses a two-phase evaluation Focuses on factual accuracy and
Grounding detailed responses grounded in source (eligibility and factual grounding) information synthesis while excluding
[61] documents, with inputs reaching up with assessments from frontier LLM creative or complex reasoning tasks.
to 32,000 tokens. judges.

ProcessBench 2024 Error Detection Comprises 3,400 math problem cases Evaluates models’ ability to detect Targets granular error detection in
[62] with step-by-step solutions and the earliest error in reasoning; mathematical problem solving.
human-annotated error locations. compares process reward models with
LLM-based critics.

OmniDocBench 2024 Document A multi-source dataset spanning nine Provides a detailed, multi-level Addresses challenges such as fuzzy
[63] Understanding document types with 19 layout evaluation framework for document scans, watermarks, and complex
categories and 14 attribute labels. content extraction, contrasting layouts in document processing.
modular pipelines with end-to-end
methods.

Agent-as-a- 2024 Evaluation Evaluated on 55 code generation Leverages agentic systems to provide Reduces evaluation cost and time for
Judge [64] Methodology tasks with 365 hierarchical user granular, intermediate feedback; agentic systems, particularly in code
requirements. achieves up to 90% alignment with generation tasks.
human judgments.

JudgeBench 2024 Judgment Consists of 350 challenging response Transforms existing datasets into Aims to objectively assess
[65] Evaluation pairs across knowledge, reasoning, paired comparisons with objective LLM-based judges; fine-tuning can
math, and coding domains. correctness, mitigating positional bias boost judge accuracy significantly.
through double evaluation.

SimpleQA 2023 Factual QA Contains 4,326 fact-seeking questions Focuses on evaluating factual Highlights current limitations in
[66] across domains; uses a strict accuracy and reveals models’ handling straightforward, factual
three-tier grading system. overconfidence in incorrect responses queries.
through repeated testing.

FineTasks [67] 2023 Multilingual Task Evaluates 185 candidate tasks across Employs metrics such as Provides a scalable, multilingual
Selection nine languages, ultimately selecting monotonicity, low noise, non-random evaluation platform that highlights the
96 reliable tasks; supports over 550 performance, and model ordering impact of task formulation.
tasks overall. consistency to assess task quality.

FRAMES [68] 2024 Retrieval & Consists of 824 multi-hop questions Unifies evaluations of factual Baseline experiments show
Reasoning requiring integration of 2–15 accuracy, retrieval, and reasoning; improvements from 40% (without
Wikipedia articles. labels questions with specific retrieval) to 66% (with multi-step
reasoning types (e.g., numerical, retrieval).
tabular).

DABStep [69] 2025 Step-Based A step-based approach for multi-step Decomposes complex problem Highlights the significant challenges
Reasoning reasoning tasks; the best model solving into discrete steps with in training models for complex,
achieves only a 16% success rate. iterative refinement and iterative reasoning.
self-correction.
7

TABLE III: Summary of LLM Benchmarks (Part 2)

Benchmark / Year Evaluation Focus Key Features / Metrics Innovations/Techniques Observations
Dataset

BFCL v2 [70] 2025 Function Calling Contains 2,251 Leverages real-world, Demonstrates that models such as
question-function-answer pairs user-contributed data to address Claude 3.5 and GPT-4 outperform
covering simple to parallel function issues like data contamination and others, while some open models
calls. bias in function calling evaluation. struggle.

SWE-Lancer 2025 Software Consists of over 1,400 freelance Uses triple-verified tests for Indicates that even advanced models
[71] Engineering software engineering tasks, including independent tasks and benchmarks (e.g., Claude 3.5 Sonnet) have low
independent and managerial tasks managerial decisions against hiring pass rates (26.2%) on implementation
with real-world payout data. manager selections. tasks.

CRAG 2024 Retrieval- Comprises 4,409 question-answer Evaluates the generative component Highlights performance drops for
Benchmark Augmented pairs across 5 domains; simulates of RAG pipelines; shows questions involving highly dynamic
[72] Generation retrieval with mock APIs. improvement from 34% to 63% or less popular facts.
accuracy with advanced RAG
methods.

OCCULT 2025 Cybersecurity A lightweight framework for Simulates real-world threat scenarios Preliminary results indicate models
Benchmark operational evaluation of to assess LLM capabilities in like DeepSeek-R1 achieve over 90%
[73] cybersecurity risks; includes three offensive cyber operations. in Threat Actor Competency Tests.
distinct OCO benchmarks.

DIA 2024 Dynamic Problem Uses dynamic question templates Introduces innovative metrics for Reveals gaps in handling complex
Benchmark Solving with mutable parameters across reliability and confidence over tasks and compares models’
[74] domains (math, cryptography, multiple attempts; emphasizes self-assessment abilities.
cybersecurity, computer science). adaptive intelligence.

CyberMetric 2024 Cybersecurity A suite of multiple-choice Q&A Generated using GPT-3.5 and RAG, it Demonstrates that larger,
Benchmark Knowledge datasets (CyberMetric-80, -500, benchmarks cybersecurity knowledge domain-specific models outperform
[75] -2000, -10000) validated over 200 against human performance. smaller ones in cybersecurity
human expert hours. understanding.

BIG-Bench 2025 Challenging An elevated-difficulty variant of Replaces each BBH task with a more Emphasizes substantial room for
Extra Hard Reasoning BIG-Bench Hard; average accuracy is challenging variant to probe improvement in general-purpose
[76] 9.8% for general models and 44.8% reasoning capabilities robustly. reasoning skills.
for reasoning-specialized models.

MultiAgentBench 2025 Multi-Agent Encompasses six domains: research Investigates various coordination GPT-4o-mini achieves the highest
[77] proposal writing, Minecraft structure protocols (star, chain, tree, graph); average task score; highlights synergy
building, database error analysis, peer-to-peer communication plus vs. complexity trade-offs in
collaborative coding, competitive cognitive planning yields a 3% multi-agent LLM settings.
Werewolf gameplay, and resource improvement in milestone
bargaining. achievement. Graph-based protocols
outperform others in research tasks.

GAIA [78] 2024 General AI 466 curated questions with reference Emphasizes everyday reasoning tasks Highlights the large performance gap
Assistants answers; humans achieve 92% involving multi-modality, web between humans and SOTA models;
accuracy while GPT-4 with plugins browsing, and tool use. Targets AI aims to measure truly
only reaches 15%. robustness over specialized skills. general-purpose AI capabilities.
CASTLE [79] 2025 Vulnerability 250 hand-crafted micro-benchmark Integrates evaluations across 13 static Formal verification tools (e.g.,
detection in source programs covering 25 common analysis tools, 10 LLMs, and two ESBMC) minimize false positives but
code CWEs; introduces the novel CASTLE formal verification tools; provides a miss vulnerabilities beyond model
Score metric unified framework for comparing checking; static analyzers generate
diverse methods excessive false positives; LLMs
perform well on small code snippets,
but accuracy declines and
hallucinations increase as code size
grows
SPIN-Bench 2025 Strategic Planning, Evaluates reasoning and strategic Systematically varies action spaces, Reveals that while LLMs perform
[80] Interaction, and behavior in diverse social settings by state complexity, and the number of basic fact retrieval and short-range
Negotiation combining classical PDDL tasks, interacting agents to simulate realistic planning reasonably well, they
competitive board games, cooperative social interactions, providing both a struggle with deep multi-hop
card games, and multi-agent benchmark and an arena for reasoning and socially adept
negotiation scenarios. multi-agent evaluation. coordination, highlighting a
significant gap in robust multi-agent
planning and human–AI teaming.

τ -bench [81] 2024 Conversational Evaluates dynamic, multi-turn Integrates domain-specific API tool Reveals that even state-of-the-art
Agent Evaluation conversations by comparing the final usage and strict policy adherence agents (e.g., GPT-4o) succeed on less
database state with an annotated goal within simulated user interactions to than 50% of tasks, with marked
state using a novel passk metric. assess agent reliability over multiple inconsistency (e.g., pass8 < 25% in
trials. retail), highlighting the need for
improved consistency and
rule-following.
8

A. ENIGMAEVAL benchmark 90% accuracy, HLE presents a significantly more demanding

ENIGMAEVAL [57] is a benchmark designed to rigorously test, featuring 3,000 questions spanning over 100 subjects
evaluate advanced language models’ multimodal and long- including mathematics, humanities, and the natural sciences.
context reasoning capabilities using challenging puzzles de- This benchmark is the product of a global collaborative
rived from global competitions. The dataset comprises 1,184 effort, with nearly 1,000 subject matter experts from over 500
complex puzzles that combine text and images, requiring mod- institutions contributing questions that are both multi-modal
els to synthesize disparate clues, perform multi-step deductive and resistant to quick internet retrieval, ensuring that only
reasoning, and integrate visual and semantic information to genuine deep academic understanding can lead to success. The
arrive at unambiguous, verifiable solutions. Unlike conven- tasks, which include both multiple-choice and short-answer
tional benchmarks focusing on well-structured academic tasks, formats with clearly defined, verifiable answers, expose a sub-
ENIGMAEVAL pushes models into unstructured, creative stantial performance gap: current state-of-the-art LLMs, such
problem-solving scenarios where even state-of-the-art systems as DeepSeek R1, OpenAI’s models, Google DeepMind Gemini
achieve only about 7% accuracy on standard puzzles and fail Thinking, and Anthropic Sonnet 3.5, perform at less than 10%
on the hardest ones. accuracy and suffer from high calibration errors, indicating
overconfidence in incorrect responses. The results underscore
that while existing benchmarks may no longer provide a
B. MMLU Benchmark
meaningful measure of progress, HLE serves as a critical
Measuring Massive Multitask Language Understanding tool for assessing the true academic reasoning capabilities of
(MMLU) [58] is a comprehensive benchmark designed by LLMs, potentially heralding a new era in benchmark design
Hendrycks et al. (2021) to evaluate large language models as the field moves toward more challenging and nuanced
across a diverse range of subjects, from elementary mathe- evaluations in the pursuit of artificial general intelligence.
matics to professional law. The benchmark comprises 57 tasks
that test models’ ability to apply broad world knowledge and
problem-solving skills in zero-shot and few-shot settings, em- E. FACTS Grounding benchmark
phasizing generalization without task-specific fine-tuning. The Google DeepMind introduced FACTS Grounding [61], a
study also uncovers challenges related to model calibration and comprehensive benchmark designed to evaluate how accu-
the imbalance between procedural and declarative knowledge, rately LLMs ground their long-form responses in provided
highlighting critical areas where current models fall short of source documents while avoiding hallucinations. The bench-
expert-level proficiency. mark comprises 1,719 meticulously crafted examples split
into 860 public and 859 private cases that require models to
C. ComplexFuncBench Benchmark generate detailed answers strictly based on a corresponding
context document, with inputs reaching up to 32,000 tokens.
Zhong et al. [59] introduced ComplexFuncBench, a novel
Covering diverse domains such as medicine, law, technology,
benchmark designed to evaluate large language models
finance, and retail, FACTS Grounding excludes tasks that re-
(LLMs) on complex function calling tasks in real-world set-
quire creativity, mathematics, or complex reasoning, focusing
tings. Unlike previous benchmarks, ComplexFuncBench chal-
squarely on factual accuracy and information synthesis. To
lenges models with multi-step operations within a single turn,
ensure robust and unbiased evaluation, responses are assessed
adherence to user-imposed constraints, reasoning over implicit
in two phases: eligibility and factual grounding using a panel
parameter values, and managing extensive input lengths that
of three frontier LLM judges (Gemini 1.5 Pro, GPT-4o,
can exceed 500 tokens, including scenarios with a context win-
and Claude 3.5 Sonnet), with final scores derived from the
dow of up to 128k tokens. Complementing the benchmark, the
aggregation of these assessments. With an online leaderboard
authors present an automatic evaluation framework, Complex-
hosted on Kaggle already populated with initial results where,
Eval, which quantitatively assesses performance across over
for instance, Gemini 2.0 Flash leads with 83.6% accuracy
1,000 scenarios derived from five distinct aspects of function
FACTS Grounding aims to drive industry-wide advancements
calling. Experimental results reveal significant limitations in
in grounding and factuality, ultimately fostering greater trust
current state-of-the-art LLMs, with closed models like Claude
and reliability in LLM applications.
3.5 and OpenAI’s GPT-4 outperforming open models such as
Qwen 2.5 and Llama 3.1. Notably, the study identifies preva-
lent issues, including value errors and premature termination F. ProcessBench benchmark
in multi-step function calls, underscoring the need for further
Qwen team [62] introduced ProcessBench, a novel bench-
research to enhance the function-calling capabilities of LLMs
mark specifically designed to evaluate the ability of language
in practical applications.
models to detect errors within the reasoning process for
mathematical problem solving. ProcessBench comprises 3,400
D. Humanity’s Last Exam (HLE) Benchmark test cases, primarily drawn from competition- and Olympiad-
Phan et al. [60] introduced Humanity’s Last Exam (HLE), a level math problems, where each case includes a detailed,
benchmark designed to push the limits of LLMs by challeng- step-by-step solution with human-annotated error locations.
ing them with expert-level academic tasks. Unlike traditional Models are tasked with identifying the earliest erroneous step
benchmarks such as MMLU, where LLMs have achieved over or confirming that all steps are correct, thereby providing a
9

granular assessment of their reasoning accuracy. The bench- Additionally, this method offers substantial cost and time
mark is employed to evaluate two classes of models: process savings, reducing evaluation costs to approximately 2.29%
reward models (PRMs) and critic models, the latter involving ($30.58 vs. $1,297.50) and cutting evaluation time down to
general large language models (LLMs) that are prompted to 118.43 minutes compared to 86.5 hours for human assess-
critique each solution step. Experimental results reveal two ments.
key findings. First, existing PRMs generally fail to generalize
to more challenging math problems beyond standard datasets I. JudgeBench Benchmark
like GSM8K and MATH, often underperforming relative to Tan et al. [65] proposed JudgeBench, a novel benchmark
both prompted LLM-based critics and a PRM fine-tuned on designed to objectively evaluate LLM-based judges models
a larger, more complex PRM800K dataset. Second, the best that are increasingly employed to assess and improve the
open-source model tested, QwQ-32B-Preview, demonstrates outputs of large language models by focusing on their ability
error detection capabilities that rival those of the proprietary to accurately discern factual and logical correctness rather than
GPT-4o, although it still falls short compared to reasoning- merely aligning with human stylistic preferences. Unlike prior
specialized models like o1-mini. benchmarks that rely primarily on crowdsourced human evalu-
ations, JudgeBench leverages a carefully constructed set of 350
G. OmniDocBench Benchmark challenging response pairs spanning knowledge, reasoning,
Ouyang et al. [63] introduced OmniDocBench, a compre- math, and coding domains. The benchmark employs a novel
hensive multi-source benchmark designed to advance auto- pipeline to transform challenging existing datasets into paired
mated document content extraction a critical component for comparisons with preference labels based on objective correct-
high-quality data needs in LLMs and RAG systems. Om- ness while mitigating positional bias through double evaluation
niDocBench features a meticulously curated and annotated with swapped order. Comprehensive testing across various
dataset spanning nine diverse document types including aca- judge architectures, including prompted, fine-tuned, multi-
demic papers, textbooks, slides, notes, and financial documents agent judges, and reward models, reveals that even strong
and utilizes a detailed evaluation framework with 19 layout models, such as GPT-4o, often perform only marginally better
categories and 14 attribute labels to facilitate multi-level as- than random guessing, particularly on tasks requiring rigorous
sessments. Through extensive comparative analysis of existing error detection in intermediate reasoning steps. Moreover, fine-
modular pipelines and multimodal end-to-end methods, the tuning can significantly boost performance, as evidenced by
benchmark reveals that while specialized models (e.g., Nougat) a 14% improvement observed in Llama 3.1 8B, and reward
outperform general vision-language models (VLMs) on stan- models achieve accuracies in the 59–64% range.
dard documents, general VLMs exhibit superior resilience and
adaptability in challenging scenarios, such as those involving J. SimpleQA Benchmark
fuzzy scans, watermarks, or colorful backgrounds. Moreover, SimpleQA [66] is a benchmark introduced by OpenAI to
fine-tuning general VLMs with domain-specific data leads to assess and improve the factual accuracy of large language
enhanced performance, as evidenced by high accuracy scores models on short, fact-seeking questions. Comprising 4,326
in tasks like formula recognition (with models such as GPT-4o, questions spanning domains such as science/tech, politics,
Mathpix, and UniMERNet achieving around 85–86.8% accu- art, and geography, SimpleQA challenges models to deliver a
racy) and table recognition (RapidTable at 82.5%). Nonethe- single correct answer under a strict three-tier grading system
less, the findings also highlight persistent challenges, notably (”correct,” ”incorrect,” or ”not attempted”). While built on
that complex column layouts continue to degrade reading order foundational datasets such as TriviaQA and Natural Questions,
accuracy across all evaluated models. SimpleQA presents a more challenging task for LLMs. Early
results indicate that even advanced models, such as OpenAI
H. Agent-as-a-Judge o1-preview, achieve only 42.7% accuracy (with Claude 3.5
Meta team proposed the Agent-as-a-Judge framework [64], Sonnet trailing at 28.9%), and models tend to exhibit over-
an innovative evaluation approach explicitly designed for confidence in their incorrect responses. Moreover, experiments
agentic systems that overcome the limitations of traditional that repeated the same question 100 times revealed a strong
methods, which either focus solely on outcomes or require correlation between higher answer frequency and overall ac-
extensive manual labor. This framework provides granular, curacy. This benchmark thus provides critical insights into the
intermediate feedback throughout the task-solving process by current limitations of LLMs in handling straightforward, fac-
leveraging agentic systems to evaluate other agentic systems. tual queries. It underscores the need for further improvements
The authors demonstrate its effectiveness on code generation in grounding model outputs in reliable, factual data.
tasks using DevAI, a new benchmark comprising 55 real-
istic automated AI development tasks annotated with 365 K. FineTasks
hierarchical user requirements. Their evaluation shows that FineTasks [67] is a data-driven evaluation framework de-
Agent-as-a-Judge not only dramatically outperforms the con- signed to systematically select reliable tasks for assessing
ventional LLM-as-a-Judge approach (which typically achieves LLMs across diverse languages. Developed as the first step
a 60–70% alignment rate with human assessment) but also toward the broader FineWeb Multilingual initiative, Fine-
reaches an impressive 90% alignment with human judgments. Tasks evaluates candidate tasks based on four critical metrics:
10

TABLE IV: LLM Benchmark Comparison: Multimodal, Task Diversity, Reasoning & Agentic AI Evaluation

Benchmark Year Multimodal Task Diversity Reasoning Agentic AI

DROP [82] 2019 No English discrete reasoning comprehension High High No
MMLU [58] 2020 No Academic/general knowledge High Moderate No
MATH [83] 2021 No Evaluating mathematical reasoning High High No
Codex [84] 2021 No Evaluating LLMs trained on code Medium Medium No
MGSM [85] 2022 No Multilingual grade-school math problems High High No
FACTS Grounding [61] 2023 No Factual grounding in long responses High Low No
SimpleQA [66] 2023 No Factual Q&A High Low No
PersonaGym [86] 2024 No Dynamic evaluation framework for persona agents High High Yes
FineTasks [67] 2023 No Multilingual task selection High Medium No
GAIA [78] 2023 Yes General AI assistant tasks High High No
OmniDocBench [63] 2024 Yes Document content extraction High Medium No
ProcessBench [62] 2024 No Error detection in math solutions Low High No
MIRAI [87] 2024 No Evaluating llm agents for event forecasting High High Yes
AppWorld [88] 2025 No Benchmarking Interactive Coding Agents High High Yes
VisualAgentBench[89] 2024 Yes Benchmark for evaluating Large Multimodal Models High High Yes
ScienceAgentBench [90] 2024 No Evaluation of language agents for Scientific Discovery High High Yes
Agent-SafetyBench [91] 2024 No Safety evaluation of LLM agents High High Yes
DiscoveryBench [92] 2024 No Data-Driven Discovery High High Yes
BLADE [93] 2024 No Benchmark for data-driven scientific discovery High High Yes
Dyn-VQA [8] 2024 Yes Adaptive VQA multimodal benchmark High High Yes
Agent-as-a-Judge [64] 2024 No Code generation evaluation Low Low Yes
JudgeBench [65] 2024 No Evaluation of LLM-based judges High High No
FRAMES [68] 2024 No Factuality & retrieval for RAG High High No
MedChain [94] 2024 No Interactive clinical decision adaptation High High Yes
CRAG [72] 2024 No Factual Q&A for RAG systems High High No
DIA [74] 2024 Yes Dynamic problem solving High High No
CyberMetric [75] 2024 No Cybersecurity Q&A Low Low No
TeamCraft [95] 2024 Yes Collaborative Minecraft multimodal evaluation High High Yes
AgentHarm [96] 2024 No LLM jailbreak robustness evaluation High High Yes
τ -bench [81] 2024 No Conversational Agent Evaluation High High Yes
LegalAgentBench [97] 2024 No Evaluating LLM Agents in Legal Domain High High Yes
GPQA [98] 2024 No Biology, physics, and chemistry High High No
ENIGMAEVAL [57] 2025 Yes Complex multimodal puzzles Low High No
ComplexFuncBench [59] 2025 No Function calling tasks Medium High No
MedAgentsBench [99] 2025 No Complex medical reasoning & treatment planning High High Yes
Humanity’s Last Exam [60] 2025 Yes Expert-level academic tasks High High No
DABStep [69] 2025 No Step-based multi-step reasoning Low High No
BFCL v2 [70] 2025 No Function calling evaluation High High No
SWE-Lancer [71] 2025 No Freelance software engineering tasks High Moderate No
OCCULT [73] 2025 No Cyber security operational tasks Medium High No
BIG-Bench Extra Hard [76] 2025 No Challenging reasoning tasks High High No
MultiAgentBench [77] 2025 Yes Multi-agent coordination tasks High High Yes
CASTLE [79] 2025 No Software vulnerability detection Low Medium No
EmbodiedEval [100] 2025 Yes 3D embodied tasks benchmark Medium High Yes
SPIN-Bench [80] 2025 Yes Strategic planning & social reasoning High High Yes
OlympicArena [101] 2025 Yes Olympic competition problems Medium High No
SciReplicate-Bench [102] 2025 No Algorithm-driven code generation High High Yes
EconAgentBench [103] 2025 No Decision-making tasks in economic environments High High Yes
VeriLA [104] 2025 No Human-centered LLM failure verification High High Yes
CapaBench [105] 2025 No Evaluation of modular contributions in LLM agents High High Yes
AgentOrca [106] 2025 No Dual-system agent compliance evaluation High High Yes
ProjectEval [107] 2025 No Project-level code generation evaluation Medium High Yes
RefactorBench [108] 2025 No Autonomous multi-file refactoring evaluation High High Yes
BEARCUBS [109] 2025 Yes Multimodal web agents evaluation High Medium Yes
Robotouille [110] 2025 No Asynchronous Planning Benchmark High High Yes
DSGBench [111] 2025 No Strategic games decision evaluation Medium High Yes
TheoremExplainBench [112] 2025 Yes STEM theorem animation videos Medium High Yes
RefuteBench 2.0 [113] 2025 No Multi-turn LLM feedback evaluation High High Yes
MLGym [114] 2025 Yes ML agents automate research High High Yes
DataSciBench [115] 2025 No LLM Data Science Benchmark High High Yes
EmbodiedBench [116] 2025 Yes Vision-driven embodied agent evaluation High High Yes
BrowseComp [117] 2025 No Benchmark for Browsing Agents High High Yes
Vending-Bench [118] 2025 No Long-horizon business simulation Medium High Yes
MLE-bench [119] 2025 No ML engineering-related competitions from Kaggle Medium High Yes
SWE-PolyBench [120] 2025 No Evaluation of coding agents High High Yes
Multi-SWE-bench [121] 2025 No Multilingual Benchmark for Issue Resolving High High No
11

monotonicity, low noise, non-random performance, and model improvements, experimental results reveal that even the best-
ordering consistency to ensure robustness and reliability. In performing model under this framework only achieves a 16%
an extensive study, the Hugging Face team tested 185 candi- success rate on the evaluated tasks. This modest accuracy un-
date tasks across nine languages (including Chinese, French, derscores the significant challenges that remain in effectively
Arabic, Russian, Thai, Hindi, Turkish, Swahili, and Telugu), training models for complex, iterative reasoning and highlights
ultimately selecting 96 final tasks that cover domains such the need for further research and optimization.
as reading comprehension, general knowledge, language un-
derstanding, and reasoning. The work further reveals that the N. BFCL v2 benchmark
formulation of tasks has a significant impact on performance;
for instance, Cloze format tasks are more effective during Mao et al. [70] propose BFCL v2, a novel benchmark and
early training phases, while multiple-choice formats yield leaderboard designed to evaluate large language models’ func-
better evaluation results. Recommended evaluation metrics tion calling abilities using real-world, user-contributed data.
include length normalization for most tasks and pointwise The benchmark comprises 2,251 question-function-answer
mutual information (PMI) for complex reasoning challenges. pairs, enabling comprehensive assessments across a range of
Benchmarking 35 open and closed-source LLMs demonstrated scenarios from multiple and straightforward function calls to
that open models are narrowing the gap with their proprietary parallel executions and irrelevance detection. By leveraging
counterparts, with Qwen 2 models excelling in high- and mid- authentic user interactions, BFCL v2 addresses prevalent is-
resource languages and Gemma-2 particularly strong in low- sues such as data contamination, bias, and limited gener-
resource settings. Moreover, the FineTasks framework supports alization in previous evaluation methods. Initial evaluations
over 550 tasks across various languages, providing a scalable reveal that models like Claude 3.5 and GPT-4 consistently
and comprehensive platform for advancing multilingual large outperform others, with Mistral, Llama 3.1 FT, and Gemini
language model (LLM) evaluation. following in performance. However, some open models, such
as Hermes, struggle due to potential prompting and formatting
challenges. Overall, BFCL v2 offers a rigorous and diverse
L. FRAMES benchmark platform for benchmarking the practical capabilities of LLMs
Google team [68] propose FRAMES (Factuality, Retrieval, in interfacing with external tools and APIs, thereby providing
and Reasoning MEasurement Set), a comprehensive evaluation valuable insights for future advancements in function calling
dataset specifically designed to assess the capabilities of and interactive AI systems.
retrieval-augmented generation (RAG) systems built on LLMs.
FRAMES addresses a critical need by unifying evaluations of O. SWE-Lancer benchmark
factual accuracy, retrieval effectiveness, and reasoning abil- OpenAI team [71] presents SWE-Lancer, an innovative
ity in an end-to-end framework, rather than assessing these benchmark comprised of over 1,400 freelance software engi-
facets in isolation. The dataset comprises 824 challenging neering tasks collected from Upwork, representing more than
multi-hop questions spanning diverse topics, including history, $1 million in real-world payouts. This benchmark encom-
sports, science, and health, each requiring the integration of passes both independent engineering tasks, ranging from minor
information from between two and fifteen Wikipedia articles. bug fixes to substantial feature implementations valued up to
By labeling questions with specific reasoning types, such as $32,000, and managerial tasks, where models must select the
numerical or tabular. FRAMES provides a nuanced benchmark best technical proposals. Independent tasks are rigorously eval-
to identify the strengths and weaknesses of current RAG uated using end-to-end tests that have been triple-verified by
implementations. Baseline experiments reveal that state-of- experienced engineers. At the same time, managerial decisions
the-art models like Gemini-Pro-1.5-0514 achieve only 40% are benchmarked against the selections made by the original
accuracy when operating without retrieval mechanisms, but hiring managers. Experimental results indicate that state-of-
their performance increases significantly to 66% with a multi- the-art models, such as Claude 3.5 Sonnet, still struggle with
step retrieval pipeline, representing a greater than 50% im- the majority of these tasks, achieving a 26.2% pass rate on
provement. independent tasks and 44.9% on managerial tasks, which
translates to an estimated earning of $403K a figure well below
M. DABStep benchmark the total available value. Notably, the analysis highlights that
while models tend to perform better in evaluative managerial
DabStep [69] is a new framework from Hugging Face that roles than in direct code implementation, increasing inference-
pioneers a step-based approach to enhance the performance time computing can enhance performance.
and efficiency of language models on multi-step reasoning
tasks. DabStep addresses the challenges of traditional end-
to-end inference by decomposing complex problem-solving P. Comprehensive RAG Benchmark (CRAG)
into discrete, manageable steps, enabling models to refine Yang et al. [72] propose the Comprehensive RAG Bench-
their outputs through step-level feedback and iterative dynamic mark (CRAG), a novel dataset designed to evaluate the factual
adjustments. This method is designed to enable models to self- question-answering capabilities of Retrieval-Augmented Gen-
correct and navigate the complexities of multi-step reasoning eration systems rigorously. CRAG comprises 4,409 question-
processes more effectively. However, despite these innovative answer pairs across five domains and eight distinct question
12

categories. It incorporates mock APIs to simulate web and S. CyberMetric benchmark

Knowledge Graph retrieval, thereby reflecting the varied levels Tihanyi et al. [75] introduces a suite of novel
of entity popularity and temporal dynamism encountered in multiple-choice Q&A benchmark datasets CyberMetric-80,
real-world scenarios. Empirical results show that state-of-the- CyberMetric-500, CyberMetric-2000, and CyberMetric-10000
art large language models without grounding achieve only designed to evaluate the cybersecurity knowledge of LLMs
around 34% accuracy on CRAG, and that incorporating simple rigorously. By leveraging GPT-3.5 and Retrieval-Augmented
RAG methods improves this to just 44%, whereas industry- Generation (RAG), the authors generated questions from
leading RAG systems can reach 63% accuracy without hal- diverse cybersecurity sources such as NIST standards, research
lucination. The benchmark also highlights significant perfor- papers, publicly accessible books, and RFCs. Complete with
mance drops for questions involving highly dynamic, lower- four possible answers, each question underwent extensive
popularity, or more complex facts. Notably, CRAG focuses rounds of error checking and refinement, with over 200 hours
solely on evaluating the generative component of the RAG of human expert validation to ensure accuracy and domain
pipeline, and early findings indicate that Llama 3 70B nearly relevance. Evaluations were conducted on 25 state-of-the-art
matches GPT-4 Turbo across these tasks. large language models (LLMs), and the results were further
benchmarked against human performance on CyberMetric-80
Q. OCCULT Benchmark in a closed-book scenario. Findings reveal that models like
GPT-4o, GPT-4-turbo, Mixtral-8x7 B-Instruct, Falcon-180
Kouremetis et al. [73] present OCCULT, a novel and B-Chat, and GEMINI-pro 1.0 exhibit superior cybersecurity
lightweight operational evaluation framework that rigorously understanding, outperforming humans on CyberMetric-80,
measures the cybersecurity risks associated with using large while smaller models such as Llama-3-8B, Phi-2, and
language models (LLMs) for offensive cyber operations Gemma-7b lag behind, underscoring the value of model scale
(OCO). Traditionally, evaluating AI in cybersecurity has relied and domain-specific data in this challenging field.
on simplistic, all-or-nothing tests such as capture-the-flag exer-
cises, which fail to capture the nuanced threats faced by mod-
ern infrastructure. In contrast, OCCULT enables cybersecurity T. BIG-Bench Extra Hard
experts to craft repeatable and contextualized benchmarks by A team from Google DeepMind [76] addresses a critical
simulating real-world threat scenarios. The authors detail three gap in evaluating large language models by tackling the limi-
distinct OCO benchmarks designed to assess the capability tations of current reasoning benchmarks, which have primarily
of LLMs to execute adversarial tactics, providing preliminary focused on mathematical and coding tasks. While the BIG-
evaluation results that indicate a significant advancement in Bench dataset [122] and its more complex variant, BIG-Bench
AI-enabled cyber threats. Most notably, the DeepSeek-R1 Hard (BBH) [123], have provided comprehensive assessments
model correctly answered over 90% of questions in the Threat of general reasoning abilities, recent advances in LLMs have
Actor Competency Test for LLMs (TACTL). led to saturation, with state-of-the-art models achieving near-
perfect scores on many BBH tasks. To overcome this, the
authors introduce BIG-Bench Extra Hard (BBEH). This novel
R. DIA benchmark
benchmark replaces each BBH task with a more challenging
Dynamic Intelligence Assessment (DIA) [74] is introduced variant designed to probe similar reasoning capabilities at an
as a novel methodology to more rigorously test and compare elevated difficulty level. Evaluations on BBEH reveal that
the problem-solving abilities of AI models across diverse do- even the best general-purpose models only achieve an average
mains such as mathematics, cryptography, cybersecurity, and accuracy of 9.8%, while reasoning-specialized models reach
computer science. Unlike traditional benchmarks that rely on 44.8%, highlighting substantial room for improvement and
static question-answer pairs often allowing models to perform underscoring the ongoing challenge of developing LLMs with
uniformly well or rely on memorization DIA employs dynamic robust, versatile reasoning skills.
question templates with mutable parameters, presented in
various formats including text, PDFs, compiled binaries, visual
puzzles, and CTF-style challenges. This framework also intro- U. MultiAgentBench benchmark
duces four innovative metrics to evaluate a model’s reliability Zhu et al. [77] introduce MultiAgentBench, a benchmark
and confidence across multiple attempts, revealing that even specifically designed to evaluate the capabilities of multi-
simple questions are frequently answered incorrectly when agent systems powered by LLMs in dynamic, interactive
posed in different forms. Notably, the evaluation shows that environments. Unlike traditional benchmarks that focus on
while API models like GPT-4o may overestimate their math- single-agent performance or narrow domains, MultiAgent-
ematical capabilities, models such as ChatGPT-4o perform Bench encompasses six diverse domains, including research
better due to practical tool usage, and OpenAI’s o1-mini excels proposal writing, Minecraft structure building, database error
in self-assessment of task suitability. Testing 25 state-of-the- analysis, collaborative coding, competitive Werewolf game-
art LLMs with DIA-Bench reveals significant gaps in handling play, and resource bargaining to measure both task comple-
complex tasks and in adaptive intelligence, establishing a new tion and the quality of agent coordination using milestone-
standard for evaluating both problem-solving performance and based performance indicators. The study investigates various
a model’s ability to recognize its own limitations. coordination protocols, such as star, chain, tree, and graph
13

topologies, and finds that direct peer-to-peer communication performance bottlenecks in current large language models
and cognitive planning are particularly effective evidenced by (LLMs), which, while adept at factual retrieval and short-
a 3% improvement in milestone achievement when planning range planning, struggle with deep multi-hop reasoning, spa-
is employed while also noting that adding more agents can tial inference, and socially coordinated decision-making. For
decrease performance. Among the models evaluated (GPT- instance, models perform reasonably well on simple tasks like
4o-mini, 3.5, and Llama), GPT-4o-mini achieved the highest Tic-Tac-Toe but falter in complex environments such as Chess
average task score, and graph-based coordination protocols or Diplomacy, and even the best models achieve only around
outperformed other structures in research scenarios. 58.59% accuracy on classical planning tasks.

V. GAIA Benchmark Y. τ -bench

GAIA [78] is a groundbreaking benchmark designed to Yao et al. [81] present τ -bench, a benchmark designed
assess General AI Assistants on real-world questions that to evaluate language agents in realistic, dynamic, multi-turn
tap into fundamental abilities like reasoning, multi-modality conversational settings that emulate real-world environments.
handling, web browsing, and tool use. Unlike traditional In τ -bench, agents are challenged to interact with a simulated
benchmarks that focus on increasingly specialized tasks, GAIA user to understand needs, utilize domain-specific API tools
features conceptually simple questions solvable by humans (such as booking flights or returning items), and adhere to
at 92% accuracy that current systems, such as GPT-4 with provided policy guidelines, while performance is measured by
plugins, struggle with, achieving only 15%. Comprising 466 comparing the final database state with an annotated goal state.
meticulously curated questions with reference answers, GAIA A novel metric, passk , is introduced to assess reliability over
shifts the evaluation paradigm toward measuring AI robustness multiple trials. Experimental findings reveal that even state-
in everyday reasoning tasks, a critical step toward achieving of-the-art function-calling agents like GPT-4o succeed on less
true Artificial General Intelligence (AGI). This substantial than 50% of tasks, with significant inconsistency (for example,
performance gap between humans and state-of-the-art models pass8 scores below 25% in retail domains) and markedly lower
emphasizes the need for AI systems that can mimic the success rates for tasks requiring multiple database writes.
general-purpose, resilient reasoning exhibited by average hu- These results underscore the need for enhanced methods that
man problem solvers. improve consistency, adherence to rules, and overall reliability
in language agents for real-world applications.
W. CASTLE Benchmark
Z. Discussion and Comparison of LLM Benchmarks
Dubniczky et al. [79] introduce CASTLE, a novel bench-
marking framework for evaluating software vulnerability Table IV presents an extensive overview of benchmarks de-
detection methods, addressing existing approaches’ critical veloped from 2019 to 2025 for evaluating large language mod-
weaknesses. CASTLE assesses 13 static analysis tools, 10 els (LLMs) concerning multimodal capabilities, task scope,
LLMs, and two formal verification tools using a meticulously diversity, reasoning, and agentic behaviors. Early benchmarks,
curated dataset of 250 micro-benchmark programs that cover such as DROP [82], MMLU [58], MATH [83], Codex [84],
25 common CWEs. The framework proposes a new evaluation MGSM [85], FACTS Grounding [61], and SimpleQA [66],
metric, the CASTLE Score, to enable fair comparisons across concentrated on core competencies like discrete reasoning,
different methods. Results reveal that while formal verification academic knowledge, mathematical problem solving, and fac-
tools like ESBMC minimize false positives, they struggle with tual grounding. These pioneering efforts lay the groundwork
vulnerabilities beyond the scope of model checking. Static an- for performance evaluation in language understanding and
alyzers often generate excessive false positives, which burden reasoning tasks, setting a baseline against which later, more
developers with manual validation. LLMs perform strongly sophisticated benchmarks have been compared.
on small code snippets; however, their accuracy declines, and A notable progression in benchmark design is observed
hallucinations increase as code size grows. These findings with the emergence of frameworks that target more complex
suggest that, despite current limitations, LLMs hold significant agentic and multimodal tasks. For instance, PersonaGym [86]
promise for integration into code completion frameworks, and FineTasks [67] introduce dynamic persona evaluation and
providing real-time vulnerability prevention and marking an multilingual task selection. GAIA [78] expands the evaluative
important step toward more secure software systems. scope to general AI assistant tasks while OmniDocBench [63]
and ProcessBench [62] address document extraction and error
detection in mathematical solutions. Further, MIRAI [87],
X. SPIN-Bench Benchmark AppWorld [88], VisualAgentBench [89], and ScienceAgent-
Yao et al. [80] introduce a comprehensive evaluation frame- Bench [90] explore various facets of multimodal and scientific
work, SPIN-Bench, highlighting the challenges of strategic discovery tasks. This decade-spanning evolution is comple-
planning and social reasoning in AI agents. Unlike traditional mented by additional evaluations focusing on safety (Agent-
benchmarks focused on isolated tasks, SPIN-Bench combines SafetyBench [91]), discovery (DiscoveryBench [92]), code
classical planning, competitive board games, cooperative card generation (BLADE [93], Dyn-VQA [8], and Agent-as-a-
games, and negotiation scenarios to simulate real-world social Judge [64]), judicial reasoning (JudgeBench [65]), and clinical
interactions. This multifaceted approach reveals significant decision making (MedChain [94]), among others including
14

FineTasks EmbodiedEval
EmbodiedBench
[67] [116] [100] BEARCUBS
Multi-
SWE- ENIGMAEVAL [109]
bench [57]
[121] OlympicArena
[101]
TheoremExplainBench
[112]
DIA
[74]
VisualAgentBench
[89]
Dyn-VQA
[8]
Multimodal,
Task Visual &
TeamCraft JudgeBench Selection Embodied OmniDocBench
[65] BLADE Evaluations [63] EconAgentBench
[95]
AgentHarm [93] [103]
[96]
DiscoveryBench OCCULT
[92] GAIA [73]
τ -bench [78]
[81]
Agent-
CyberMetric
SafetyBench
[75]
[91]
MultiAgentBench
[77]
ScienceAgentBench MedAgentsBench
[90] [99]
SPIN-
Bench Agentic & Domain-
[80]
Interactive Specific
MIRAI LegalAgentBench
Evaluations [87] Evaluations [97]
VeriLA
[104]
BrowseComp
PersonaGym MedChain
[117]
[86] [94]
CapaBench
[105]
DataSciBench
[115]
AgentOrca
[106] LLM Benchmark
MLGym
Robotouille [114]
RefuteBench
[110] DSGBench 2.0
[111] [113]

DROP GPQA
[82] [98]
Academic
& Factual
General Grounding
MMLU
Knowledge & CRAG
[58] Retrieval [72]
Reasoning

BIG-Bench
FRAMES
Extra Hard
[68]
[76]
BFCL v2
Humanity’s [70]
SimpleQA
Last Exam
[66]
[60]

ComplexFuncBench FACTS
DABStep
[59] Grounding
[69]
[61]
Mathematical Code &
Problem Software MLE-
Solving Engineering bench
[119]
Codex
[84]
SWE-
PolyBench
Agent-as- [120]
a-Judge
[64]
CASTLE
[79]
MATH AppWorld
[83] [88] SWE-
SciReplicate- Lancer
MGSM [71]
Bench RefactorBench
[85] ProcessBench ProjectEval
[102] [108]
[62] [107]

Fig. 2: Classification of LLM Benchmarks for AI Agents Applications

FRAMES [68], CRAG [72], DIA [74], CyberMetric [75], TheoremExplainBench [112], RefuteBench 2.0 [113], ML-
TeamCraft [95], AgentHarm [96], τ -bench [81], LegalAgent- Gym [114], DataSciBench [115], EmbodiedBench [116],
Bench [97], and GPQA [98]. BrowseComp [117], and MLE-bench [119]. Collectively, these
benchmarks exemplify the field’s shift towards more compre-
Recent benchmarks from 2025 further indicate a sub-
hensive and nuanced evaluation metrics, supporting the de-
stantial expansion in the depth and breadth of large lan-
velopment of LLMs that can tackle increasingly multifaceted,
guage model (LLM) evaluations. ENIGMAEVAL [57] and
real-world challenges.
ComplexFuncBench [59] target complex puzzles and func-
tion calling tasks, while MedAgentsBench [99] and Hu- Fig. 2 groups benchmarks into categories such as Academic
manity’s Last Exam [60] focus on advanced medical rea- & General Knowledge Reasoning, Mathematical Problem
soning and expert-level academic tasks. Additional bench- Solving, Code & Software Engineering, Factual Grounding
marks such as DABStep [69], BFCL v2 [70], SWE- & Retrieval, Domain-Specific Evaluations, Multimodal/Visual
Lancer [71], and OCCULT [73] further diversify evalua- & Embodied Evaluations, Task Selection, and Agentic &
tive criteria by incorporating multi-step reasoning, cyberse- Interactive Evaluations, illustrating the full range of tasks used
curity, and freelance software engineering challenges. The to assess LLMs in AI agent settings.
table also includes BIG-Bench Extra Hard [76], MultiA-
gentBench [77], CASTLE [79], EmbodiedEval [100], SPIN- IV. AI AGENTS
Bench [80], OlympicArena [101], SciReplicate-Bench [102],
EconAgentBench [103], VeriLA [104], CapaBench [105], This section presents a comprehensive overview of AI agent
AgentOrca [106], ProjectEval [107], RefactorBench [108], frameworks and applications developed between 2024 and
BEARCUBS [109], Robotouille [110], DSGBench [111], 2025, highlighting transformative approaches that integrate
15

TABLE V: Overview of AI Agent Frameworks: Core Concepts, Workflow, and Advantages

Agent Framework Core Idea Workflow & Components Key Advantages

LangChain [124] Integrates LLMs with diverse tools to build Combines conversational LLMs, search Customizable roles and streamlined agent
autonomous agents. integrations, and utility functions into prototyping.
iterative workflows.

LlamaIndex [125] Enables autonomous agent creation via Wraps functions into FunctionTool Simplifies agent development with a
external tool integration. objects and employs a ReActAgent for dynamic, modular pipeline.
stepwise tool selection.

CrewAI [126] Orchestrates teams of specialized AI agents Structures systems into Crew (oversight), AI Mimics human team collaboration with
for complex tasks. Agents (specialized roles), Process flexible, parallel workflows.
(collaboration), and Tasks (assignments).

Swarm [127] Provides a lightweight, stateless abstraction Defines multiple agents with specific Fine-grained control and compatibility with
for multi-agent systems. instructions and roles; enables dynamic various backends.
handoffs and context management.

GUI Agent [128] Facilitates computer control via natural Translates user instructions and screenshots Demonstrates end-to-end performance in
language and visual inputs. into desktop actions (e.g., cursor movements, real-world desktop workflows.
clicks).

Agentic Reasoning [129] Enhances reasoning by integrating Leverages web-search, coding, and Mind Achieves improved multi-step
specialized external tool-using agents. Map agents to iteratively refine multi-step problem-solving and structured knowledge
reasoning. synthesis.

OctoTools [130] Empowers LLMs for complex reasoning via Combines standardized tool cards, a strategic Outperforms similar frameworks by up to
training-free tool integration. planner, and an executor for effective tool 10.6% on varied tasks.
usage.

Agents SDK [131] Provides a modular framework for building Offers core primitives such as Agents (LLMs Streamlines development with an extensible,
autonomous agent applications that integrate with instructions, tools, handoffs, and robust architecture that enhances
LLMs with external tools and advanced guardrails), Tools (wrapped functions/APIs), debuggability and scalability, enabling rapid
features. Context for state management, along with prototyping and seamless integration of
support for Streaming, Tracing, and complex, multi-agent workflows.
Guardrails to manage multi-turn interactions.

large language models with modular tools to achieve au-

tonomous decision-making and dynamic multi-step reason- Thinking (Reasoning) & Utility Functions &
Prompt (Instructions) Knowledge Store
ing. The frameworks discussed include LangChain [124],
LlamaIndex [125], CrewAI [126], and Swarm [127], which
abstract complex functionalities into reusable components that
Strategy Development Task (Assigned
enable context management, tool integration, and iterative (Planning) Objective)
AI Query Engines

refinement of outputs. Additionally, pioneering efforts in GUI

control [128] and agentic reasoning [129], [130] demonstrate
the increasing capabilities of these systems to interact with
Knowledge Store
external environments and tools in real-time. Self-Evaluation Designated Function

In parallel, this section presents a diverse range of AI

agent applications that span materials science, biomedical Agent Execution Environment

research, academic ideation, software engineering, synthetic

data generation, and chemical reasoning. Systems such as
Fig. 3: Core Elements of AI Agents.
the StarWhisper Telescope System [132] and HoneyComb
[133] have revolutionized operational workflows by automat-
ing observational and analytical tasks in materials science. In
large language models with modular tools and utilities to
the biomedical domain, platforms like GeneAgent [134] and
build autonomous software agents. These frameworks abstract
frameworks such as PRefLexOR [135] demonstrate enhanced
complex functionalities such as natural language understand-
reliability through self-verification and iterative refinement.
ing, multi-step reasoning, and dynamic decision-making into
Moreover, innovative solutions for research ideation, exem-
reusable components that streamline prototyping, iterative
plified by SurveyX [136] and Chain-of-Ideas [137], as well
refinement, and deployment. By integrating advanced LLMs
as specialized frameworks for synthetic data generation [138]
with external tools and specialized functions, developers can
and chemical reasoning [139], collectively underscore the
create agents that process and generate language and adapt to
significant strides made in leveraging autonomous AI agents
complex workflows and diverse operational contexts [140].
for complex, real-world tasks. Table V presents an overview
of AI Agent frameworks. Fig. 3 illustrates a comprehensive AI agent framework
where each component plays a crucial role in achieving
adaptive, autonomous decision-making. An assigned task is
A. AI Agent frameworks first approached through a designated function that defines
AI agent frameworks represent a transformative paradigm the agent’s role, followed by strategy development essentially
in developing intelligent systems, combining the power of the planning phase where the agent breaks down complex
16

Fig. 4: What are Agentic Workflows?.

objectives into actionable steps. This is supported by an

iterative thinking process, driven by reasoning and guided
by prompts, which enables the agent to reflect on its ac-
tions and refine its approach. Core operational support comes
from AI query engines and utility functions that interface
with an integrated knowledge store, ensuring that both static
and real-time information are readily accessible. Ultimately,
these elements operate within an agent execution environment,
seamlessly combining planning, reasoning, and execution into Fig. 5: Agent-Driven RAG Framework.
a responsive and self-evolving system.
Agentic workflows transform traditional, rigid processes
into dynamic, adaptive systems. As illustrated in Fig. 4, these its internal knowledge store to determine whether the query
workflows begin at the user interface, where a user query is has been addressed or needs more data. When necessary, the
submitted and receives a system reply. Unlike deterministic query is decomposed into smaller, manageable sub-questions
workflows that follow fixed, unchanging rules, an agent- that are individually routed and processed through retrieval
based process involves AI agents who actively formulate a utilities [142]. These utilities fetch relevant external data,
strategy, carry out tasks using available tools, and evaluate the and the system evaluates whether the retrieved information
outcomes. This cycle, ranging from planning to execution and is applicable before producing a final output. This layered,
ultimately to assessment, where outcomes are marked as either agentic approach ensures that responses are accurate, context-
satisfactory or unsatisfactory, empowers the system to respond aware, and continuously refined throughout the process [143].
to real-world challenges more flexibly and autonomously Tab. VI demonstrates that retrieval-augmented generation
[141]. (RAG) is highly effective at producing up-to-date, accurate
Agentic Retrieval-Augmented Generation (RAG) integrates responses, making it ideal for fields like healthcare or law,
a language model’s advanced capabilities with dynamic data where precise, domain-specific information is critical. In con-
retrieval and processing. As shown in Fig. 5, the process trast, AI Agents distinguish themselves with their continu-
begins at the user interface, where a query is submitted ous learning and autonomous decision-making capabilities,
and a system reply is generated. The system first checks which make them adaptable to evolving contexts. When these
17

TABLE VI: Comparative Analysis of LLM Strategies in RAG, AI Agents, and Agentic RAG
Feature LLM Pre-trained LLM Post Training & RAG AI Agents Agentic RAG
Fine Tuning

Core Function Uses LLM for text Applies task-specific Retrieves data and Automates tasks and Integrates retrieval with
generation. tuning. generates text. decisions. adaptive reasoning.
Autonomy Basic language Enhances autonomy Limited; user-driven. Moderately autonomous. Highly autonomous.
understanding. through tuning.
Learning Relies on pre-training. Uses fine tuning for Static pre-trained Incorporates user Adapts using real-time
precision. knowledge. feedback. data.
Use Cases General applications. Domain-specific Q&A, summaries, Chatbots, automation, Complex
enhancements. guidance. workflow. decision-making tasks.
Complexity Provides baseline Adds refined Simple integration. More sophisticated. Highly complex.
complexity. capabilities.
Reliability Depends on static Improves consistency Consistent for known May vary with dynamic Reliability boosted by
training data. with updates. queries. inputs. adaptive methods.
Scalability Scales with model size. Scales with Easily scalable for static Scales moderately with Scalable for complex
domain-specific tuning. tasks. added features. systems (with extra
resources).
Integration Easily integrable with Requires domain Integrates well with Connects with Supports advanced
various apps. customization. retrieval systems. operational workflows. decision frameworks.

illustrates the architecture of a LangChain-powered scheduling

Tools
agent that processes email requests to perform calendar-related
operations [144]. Incoming emails are first parsed to extract
- initiateBooking

Can you book a meeting with

- removeBooking
- checkAvailability
relevant content and convert unstructured text into structured
Email A ferrag.mohamedamine@univ-
guelma.dz sometime tomorrow?
- retrieveBookings
- dispatchBookingLink
data. This data is then passed to the chat model, guided
- modifyBooking
by a contextual prompt that defines the assistant’s role. The
agent uses a scratchpad to reason through the request and
Email B determine the appropriate tool from a predefined set (such as
Tool

Agent Chat Model checkAvailability

checkAvailability, initiateBooking, or modifyBooking). These
tools interact with the backend booking API to execute the
requested actions, enabling seamless AI-driven scheduling.
You are a bleeding-edge
scheduling assistant that
2) LlamaIndex: The LlamaIndex framework [125] provides
interfaces via email...etc.
-....
API
a powerful and flexible platform for building autonomous AI
Email Parser -...
for Bookings
agents by seamlessly integrating large language models with
external tools. In this framework, a basic AI agent is defined
Convert unstructured email
content into structured data as a semi-autonomous software component that receives a task
for easier processing and Scratchpad
automation. Prompting and a set of tools ranging from simple Python functions to
complete query engines and iteratively selects the appropriate
tool to process each step of the task. To build such an
agent, developers first set up a clean Python environment
Fig. 6: Agent architecture using Langchain framework. and install LlamaIndex along with necessary dependencies,
then configure an LLM (for example, GPT-4 via an API
key). Next, they wrap simple functions (such as addition and
two approaches are combined into Agentic RAG, the model multiplication) into FunctionTool objects that the agent can
benefits from RAG’s fact-based grounding and AI Agents’ call, and instantiate a ReActAgent with these tools. When
dynamic adaptability, resulting in a system that minimizes prompted with a task, the agent evaluates its reasoning process,
errors and remains current by leveraging the best aspects of chooses a tool to execute the necessary operations, and loops
each methodology. through these steps until the final answer is generated. This
1) LangChain: LangChain [124] is a robust framework structured yet dynamic approach allows for the creation of
designed to simplify the development of autonomous AI customizable, agentic workflows capable of tackling complex
agents by seamlessly integrating large language models with a tasks.
diverse array of tools and data sources. In LangChain, agents 3) CrewAI: CrewAI [126] is a framework designed to
combine prepackaged components, such as conversational orchestrate autonomous teams of AI agents, each with spe-
large language models (LLMs), search engine integrations, cialized roles, tools, and objectives, to collaboratively tackle
and specialized utility functions, into coherent workflows that complex tasks. The system is organized around four key
enable multi-step reasoning and decision-making. Developers components: the Crew, which oversees the overall operation
can build custom agents by defining specific roles, tasks, and workflow; AI Agents, which serve as specialized team
and tools, allowing the agent to analyze a given prompt, members such as researchers, writers, and analysts that make
select the appropriate tool for each subtask, and iteratively autonomous decisions and delegate tasks; the Process, which
refine its response until a final answer is produced. Fig. 6 manages collaboration patterns and task assignments to ensure
18

efficient execution; and Tasks, which are individual assign- of LLMs to control computers via GUI, while also identifying
ments with clear objectives that contribute to a larger goal. the need for more comprehensive, multimodal datasets to
Key features of CrewAI include role-based agent specializa- capture real-world complexities.
tion, flexible integration of custom tools and APIs, intelligent The paper by Sun et al. [145] tackles a major challenge
collaboration that mimics natural human interaction, and ro- in training GUI agents powered by Vision-Language Models
bust task management supporting both sequential and parallel (VLMs): collecting high-quality trajectory data. Traditional
workflows. Together, these elements enable the creation of methods relying on human supervision or synthetic data
dynamic, production-ready AI teams capable of achieving generation via pre-defined tasks are either resource-intensive
sophisticated, multi-step objectives in real-world applications. or fail to capture the complexity and diversity of real-world
4) Swarm: Swarm [127] is a lightweight, experimental environments. The authors propose OS-Genesis, a novel data
library from OpenAI designed to build and manage multi- synthesis pipeline that reverses the conventional trajectory
agent systems without relying on the Assistants API. Swarm collection process to overcome these limitations. Rather than
provides a stateless abstraction that orchestrates a continu- starting with fixed tasks, OS-Genesis enables agents to explore
ous loop of agent interactions, function calls, and dynamic environments through step-by-step interactions and then derive
handoffs, offering fine-grained control and transparency. Key high-quality tasks retrospectively, with a trajectory reward
features include: model ensuring data quality.
6) Agentic Reasoning: Wu et al. [129] presents a novel
• Agent Definition: Developers can define multiple agents, framework that significantly enhances the reasoning capa-
each equipped with its own set of instructions, designated bilities of large language models by integrating external
role (e.g., ”Sales Agent”), and available functions, which tool-using agents into the inference process. The approach
are converted into standardized JSON structures. leverages three key agents: a web-search agent for real-time
• Dynamic Handoffs: Agents can transfer control to one an- retrieval of pertinent information, a coding agent for executing
other based on the conversation flow or specific function computational tasks, and a Mind Map agent that constructs
criteria, simply by returning the next agent to call. structured knowledge graphs to track and organize logical
• Context Management: Context variables are used to ini- relationships during reasoning. By dynamically engaging these
tialize and update state throughout the conversation, en- specialized agents, the framework enables LLMs to perform
suring continuity and effective information sharing across multi-step, expert-level problem solving and deep research,
agents. addressing limitations in conventional internal reasoning ap-
• Client Orchestration: The Client.run() function initiates proaches. Evaluations on challenging benchmarks such as
and manages the multi-agent dialogue by taking an initial the GPQA dataset and domain-specific deep research tasks
agent, user messages, and context, and then returning demonstrate that Agentic Reasoning substantially outperforms
updated messages, context variables, and the last active traditional retrieval-augmented generation systems and closed-
agent. source models, highlighting its potential for improved knowl-
• Direct Function Calling & Streaming: Swarm supports edge synthesis, test-time scalability, and structured problem-
direct Python function calls within agents and provides solving.
streaming responses for real-time interactions. OctoTools [130] is a robust, training-free, and user-friendly
• Flexibility: The framework is designed to be agnostic to framework designed to empower large language models to
the underlying OpenAI client, working seamlessly with tackle complex reasoning tasks across diverse domains. By
tools such as Hugging Face TGI or vLLM hosted models. integrating standardized tool cards that encapsulate various
5) GUI Agent: Hu et al. [128] introduced Claude 3.5 tool functionalities, a planner for orchestrating both high-level
Computer Use, marking a significant milestone as the first and low-level strategies, and an executor for effective tool us-
frontier AI model to offer computer control via a graphical age, OctoTools overcomes the limitations of prior methods that
user interface in a public beta setting. The study assembles a were confined to specialized domains or required extra training
diverse set of tasks, ranging from web search and productivity data. Validated across 16 varied tasks including MathVista,
workflows to gaming and file management, to rigorously MMLU-Pro, MedQA, and GAIA-Text OctoTools achieves an
evaluate the model’s ability to translate natural language average accuracy improvement of 9.3% over GPT-4o and
instructions and screenshots into precise desktop actions, such outperforms frameworks like AutoGen, GPT-Functions, and
as cursor movements, clicks, and keystrokes. The evaluation LangChain by up to 10.6% when using the same toolset.
framework not only demonstrates Claude 3.5’s unprecedented Comprehensive analysis and ablation studies demonstrate its
end-to-end performance, with a success rate of 16 out of advantages in task planning, effective tool integration, and
20 test cases, but also highlights critical areas for future multi-step problem solving, positioning it as a significant
refinement, including improved planning, action execution, advancement for general-purpose, complex reasoning appli-
and self-critique capabilities. Moreover, the performance is cations.
shown to be influenced by factors like screen resolution, and 7) Agents SDK: The OpenAI Agents SDK [131] provides a
the study reveals that while the model can perform a wide comprehensive framework for building autonomous, multi-step
range of operations, it still struggles with replicating subtle agent applications that harness the power of large language
human-like behaviors such as natural scrolling and browsing. models alongside external tools. This SDK abstracts the core
Overall, this preliminary exploration underscores the potential components necessary for agentic workflows, including agents
19

themselves which are LLMs configured with instructions, cardiologists. Designed to address the limitations of general-
tools, handoffs, and guardrails as well as the tools that enable purpose large language models (LLMs) in clinical settings,
these agents to perform external actions (such as API calls ZODIAC leverages a multi-agent collaboration architecture to
or computations). It also supports context management to process patient data across multiple modalities. Each agent
maintain state over multi-turn interactions, structured output is fine-tuned using real-world patient data adjudicated by
types for reliable data exchange, and advanced features like cardiologists, ensuring the system’s diagnostic outputs, such as
streaming, tracing, and guardrails to ensure safety and debuga- the extraction of clinically relevant characteristics, arrhythmia
bility. detection, and preliminary report generation, are accurate
and reliable. Rigorous clinical validation, conducted by in-
dependent cardiologists and evaluated across eight metrics
B. AI Agent applications
addressing clinical effectiveness and security, demonstrates
AI Agents are autonomous systems that combine large that ZODIAC outperforms industry-leading models, including
language models (LLMs), data retrieval mechanisms, and GPT-4o, Llama-3.1-405B, Gemini-pro, and even specialized
decision-making pipelines to tackle a wide array of tasks medical LLMs like BioGPT. Notably, the successful inte-
across industries. In healthcare, they assist with clinical di- gration of ZODIAC into electrocardiography (ECG) devices
agnosis and personalized treatment planning; in finance, they underscores its potential to transform healthcare delivery,
support forecasting and risk analysis; in scientific research, exemplifying the emerging trend of embedding LLMs within
they automate literature review and experimental design; and Software-as-Medical-Device (SaMD) solutions.
in software engineering, they generate, analyze, and repair Wang et al. [148] introduce MedAgent-Pro, an evidence-
code. Using domain-specific fine-tuning and structured data based, agentic system designed to enhance multi-modal med-
sources, AI agents can also drive the generation of syn- ical diagnosis by addressing key limitations of current Multi-
thetic data, facilitate chemical reasoning, support mathematical modal Large Language Models (MLLMs). While MLLMs
problem-solving, and enable creative multimedia production, have demonstrated strong reasoning and task-performing capa-
thereby expanding the reach of AI-powered automation and bilities, they often struggle with detailed visual perception and
insight generation. Fig. 7 presents both the architectural back- exhibit reasoning inconsistencies, both of which are critical in
bone and the application landscape of AI Agents. clinical settings. MedAgent-Pro employs a hierarchical work-
1) Healthcare Applications: The healthcare sector has wit- flow: at the task level, it leverages knowledge-based reasoning
nessed significant advancements through the integration of to generate reliable diagnostic plans grounded in retrieved
large language model-based agents across a wide range of clinical criteria, and at the case level, it utilizes multiple
applications. In this subsection, we present recent develop- tool agents to process multi-modal inputs and analyze diverse
ments organized into key categories, as presented in Fig. 8, in- indicators. The final diagnosis is derived from a synthesis of
cluding clinical diagnosis and decision support, mental health quantitative and qualitative evidence. Comprehensive experi-
and therapy agents, general medical assistants for workflow ments on both 2D and 3D medical diagnosis tasks demonstrate
optimization, and pharmaceutical and drug discovery agents. that MedAgent-Pro not only outperforms existing methods but
These works demonstrate how AI agents are increasingly sup- also offers enhanced reliability and interpretability, marking a
porting medical professionals, enhancing diagnostic accuracy, significant step forward in AI-assisted clinical diagnostics.
improving patient care, and accelerating research in diverse Feng et al. [150] introduce M3Builder. This novel multi-
healthcare domains. Tab. reviews AI agent applications for agent system automates machine learning workflows in the
Healthcare. medical imaging domain, a field that has traditionally needed
a) Clinical Diagnosis, Imaging & Decision Support: specialized models and tools. M3Builder is structured around
Chen et al. [146] introduce Chain-of-Diagnosis (CoD), a novel four specialized agents that collaboratively manage complex,
approach designed to enhance the interpretability of LLM- multi-step ML tasks, including automated data processing,
based medical diagnostics. By transforming the diagnostic environment configuration, self-contained auto-debugging, and
process into a transparent, step-by-step chain that mirrors a model training, all within a dedicated medical imaging ML
physician’s reasoning, CoD provides a clear reasoning path- workspace. To assess progress in this area, the authors propose
way alongside a disease confidence distribution, which aids M3Bench, a comprehensive benchmark featuring four general
in identifying critical symptoms through entropy reduction. tasks across 14 training datasets, covering five anatomies, three
This transparent methodology not only makes the diagnostic imaging modalities, and both 2D and 3D data. Evaluations
process controllable but also boosts rigor in decision-making. using seven state-of-the-art large language models as agent
Leveraging CoD, the authors developed DiagnosisGPT, an cores, such as the Claude series, GPT-4o, and DeepSeek-V3,
advanced system capable of diagnosing 9,604 diseases. Exper- demonstrate that M3Builder significantly outperforms existing
imental results demonstrate that DiagnosisGPT outperforms ML agent designs, achieving a remarkable 94.29% success rate
existing large language models (LLMs) on diagnostic bench- with Claude-3.7-Sonnet.
marks, achieving both high diagnostic accuracy and enhanced Rose et al. [151] tackles the complexities of differential
interpretability. diagnosis (DDx) by introducing the Modular Explainable DDx
Zhou et al. [147] present ZODIAC, an innovative LLM- Agent (MEDDxAgent) framework, which facilitates interac-
powered framework that elevates cardiological diagnostics tive, iterative diagnostic reasoning rather than relying on com-
to a level of professionalism comparable to that of expert plete patient profiles from the outset. Addressing limitations
20

TABLE VII: Overview of AI Agent Applications for Healthcare

Application Year Category Core Objective Workflow & Components Key Benefits/Results C W R

DiagnosisGPT 2024 Medical Enhance interpretability via a Implements CoD to yield Diagnoses 9,604 diseases;
[146] Diagnos- transparent, step-by-step chain. confidence scores and entropy outperforms existing LLMs.
tics reduction.

ZODIAC 2024 Cardiology Deliver expert-level Multi-agent LLM fine-tuned on Outperforms leading models;
[147] cardiological diagnostics. adjudicated patient data. integrated into ECG devices.

MedAgent- 2025 Medical Enhance multi-modal diagnosis Hierarchical workflow with Outperforms existing methods
Pro [148] Diagnosis by addressing visual and knowledge-based reasoning and on 2D/3D tasks with improved
reasoning gaps. multi-modal agents. reliability.

Steenstra et 2025 Therapeutic Improve counseling training LLM-powered simulated High usability and satisfaction;
al. [149] Counsel- with continuous, detailed patients with turn-by-turn enhances learning vs. traditional
ing feedback. visualizations. methods.

M3Builder 2025 Medical Automate ML workflows in Four agents manage data Achieves 94.29% success with
[150] Imaging medical imaging. processing, configuration, state-of-the-art LLM cores.
ML debugging, and training.

MEDDxAgent 2025 Differential Enable iterative, interactive Integrates a DDxDriver, history Boosts diagnostic accuracy by
[151] Diagnosis differential diagnosis. simulator, and specialized over 10% with enhanced
retrieval/diagnosis agents. explainability.

PathFinder 2025 AI- Replicate holistic WSI analysis Four agents collaboratively Outperforms state-of-the-art by
[152] assisted as done by expert pathologists. generate importance maps and 8%, exceeding average
Diagnos- diagnoses. pathologist performance by 9%.
tics

HamRaz 2025 Therapeutic Provide the first Persian PCT Combines scripted dialogues Produces more empathetic,
[153] Counsel- dataset for LLMs with culturally and adaptive LLM role-play. nuanced, and realistic
ing adapted therapy sessions. counseling interactions.

CAMI 2025 Therapeutic Automate MI-based counseling STAR framework with three Outperforms baselines in MI
[154] Counsel- with client state inference, topic LLM modules for state, topic, competency and counseling
ing exploration, and empathetic and response. realism.
response generation.

AutoCBT 2025 Therapeutic Deliver dynamic CBT via Uses single-turn agents and Generates higher-quality CBT
[155] Counsel- multi-agent routing and dynamic supervisory routing for responses vs. fixed systems.
ing supervision. tailored interventions.

PSYCHE 2025 Psychiatric Benchmark PACAs with Uses detailed psychiatric Validated for clinical
[156] Assess- simulated patient profiles and constructs and board-certified appropriateness and safety.
ment multi-turn interactions. psychiatrist evaluations.

PsyDraw 2024 Mental Analyze HTP drawings with Two-stage feature extraction and 71.03% high consistency with
[157] Health multimodal agents for early report generation; evaluated on experts; scalable screening tool.
Screening screening of LBCs. 290 submissions; pilot
deployment in schools.

EvoPatient 2024 Medical Simulate patient–doctor Iterative multi-turn consultations Improves requirement alignment
[158] Training dialogues for training via refine patient responses and by >10% and achieves higher
unsupervised LLM agents. physician questions over 200 human preference.
case simulations.

Scripted 2024 Therapeutic Constrain LLM responses via Two prompting variants execute Demonstrates reliable script
Therapy Counsel- expert-written scripts and finite 100 simulated sessions adherence and transparent
Agents ing conversational states. following deterministic decision paths.
[159] therapeutic scripts.

LIDDiA 2025 Drug Automate end-to-end drug Orchestrates LLM-driven Generates valid candidates
[160] Discov- discovery from target selection reasoning across all pipeline >70% of cases; identifies novel
ery to lead optimization. steps; evaluated on 30 targets. EGFR inhibitors.

PatentAgent 2024 PharmaceuticalStreamline patent analysis with PA-QA, PA-Img2Mol, Improves image-to-molecule
[161] Patents LLM-driven QA, PA-CoreId modules for accuracy by up to 8.37% and
image-to-molecule, and scaffold comprehensive patent insights. scaffold ID by up to 7.62%.
ID.

DrugAgent 2024 Drug Re- Accelerate drug repurposing via Combines DTI modeling, KG Improves prediction accuracy
[162] purposing multi-agent ML and knowledge extraction, and literature mining and reduces discovery time/cost.
integration. agents.

MAP [163] 2025 Inpatient Support complex inpatient Uses IPDS benchmark; +25.10% diagnostic accuracy
Decision pathways with specialized coordinated by a chief agent for vs. HuatuoGPT2-13B; +10–12%
Support triage, diagnosis, and treatment end-to-end care planning. clinical compliance over
agents. clinicians.

SynthUserEval 2025 Health Generate synthetic users for Creates structured profiles and Enables realistic,
[164] Coaching evaluating behavior-change simulates interactions with health-grounded dialogues;
agents. coaching agents. validated by expert evaluations.
C: Clinical Validation; W: Workflow Integration; R: Regulatory Compliance; : Partial; : Not Supported; : Supported.
21

Sub - AI Agent
applications Agentic AI

Mental Health, Counseling & Therapy Agents

Pharmaceutical & Drug-Related Agents Customized

LLM model Vector
Database
Agents for Astronomical Observations

Gene Set Knowledge Discovery AI Agent

LLM model
Biomedical AI Scientist Agents

Mathematical Reasoning and Problem Solving Users

....... Database
Action

AI Agent applications

Healthcare Research
Materials Science Biomedical Science Software Engineering
Applications Applications

Synthetic data Solving mathematical Geography Multimedia

Finance Applications Chemical Reasoning
generation problems Applications Applications

Fig. 7: Architecture and Application Domains of AI Agents.

in previous approaches such as evaluations on single datasets, indicate that PathFinder outperforms state-of-the-art methods
isolated component optimization, and single-attempt diag- in skin melanoma diagnosis by 8% and, notably, surpasses the
noses MEDDxAgent integrates three modular components: average performance of pathologists by 9%, establishing a new
an orchestrator (DDxDriver), a history-taking simulator, and benchmark for accurate, efficient, and interpretable AI-assisted
two specialized agents for knowledge retrieval and diagnosis diagnostics in pathology.
strategy. To ensure robust evaluation, the authors also present b) Mental Health, Counseling & Therapy Agents:
a comprehensive DDx benchmark covering respiratory, skin, Wasenmüller et al. [159] present a script-based dialog policy
and rare diseases. Their findings reveal that iterative refinement planning paradigm that enables LLM-powered conversational
significantly enhances diagnostic accuracy, with MEDDxA- agents to function as AI therapists by adhering to expert-
gent achieving over a 10% improvement across both large written therapeutic scripts and transitioning through a finite
and small LLMs while providing critical explainability in its set of conversational states. By treating the script as a deter-
reasoning process. ministic guide, the approach constrains the model’s responses
Ghezloo et al. [152] introduce Pathfinder, a novel multi- to align with a defined therapeutic framework, making decision
modal, multi-agent framework designed to replicate the holis- paths transparent for clinical evaluation and risk management.
tic diagnostic process of expert pathologists when analyz- The authors implement two variants of this paradigm, utilizing
ing whole-slide images (WSIs). Recognizing that WSIs are different prompting strategies, and generate 100 simulated
characterized by their gigapixel scale and complex structure, therapy sessions with LLM-driven patient agents. Experimen-
PathFinder employs four specialized agents a Triage Agent, tal results demonstrate that both implementations can reliably
Navigation Agent, Description Agent, and Diagnosis Agent follow the scripted policy, providing insights into their relative
that collaboratively navigate and interpret the image data. efficiency and effectiveness, and underscoring the feasibility of
The Triage Agent first determines whether a slide is benign building inspectable, rule-aligned AI therapy systems.
or risky; if deemed risky, the Navigation and Description Du et al. [158] introduce EvoPatient, a framework for gen-
Agents iteratively focus on and characterize significant re- erating simulated patients using large language models to train
gions, generating importance maps and detailed natural lan- medical personnel through multi-turn diagnostic dialogues.
guage descriptions. Finally, the Diagnosis Agent synthesizes Existing approaches focus on data retrieval accuracy or prompt
these findings to provide a comprehensive diagnostic classi- tuning, but EvoPatient emphasizes unsupervised simulation to
fication that is inherently explainable. Experimental results teach patient agents standardized presentation patterns. In this
22

system, a patient agent and doctor agents engage in iterative Yang et al. [154] present CAMI, an automated conversa-
consultations, with each dialogue cycle serving to both train tional counselor agent grounded in Motivational Interviewing
the agents and gather experience that refines patient responses (MI), a client-centered approach designed to resolve ambiva-
and physician questions. Extensive experiments across di- lence and promote behavior change. CAMI’s novel STAR
verse clinical scenarios show that EvoPatient improves re- framework integrates three LLM-powered modules client State
quirement alignment by more than 10 percent compared to inference, motivation Topic exploration, and response gEner-
state-of-the-art methods and achieves higher human preference ation to evoke “change talk” in line with MI principles. By
ratings. After evolving through 200 case simulations over a accurately inferring a client’s emotional and motivational state,
period of ten hours, the framework achieves an optimal balance exploring relevant topics, and generating empathetic, directive
between resource efficiency and performance, demonstrating responses, CAMI facilitates more effective counseling across
strong generalizability for scalable medical training. diverse populations. The authors evaluate CAMI using both
Zhang et al. [157] present PsyDraw, a multimodal LLM- automated metrics and manual assessments with simulated
driven multi-agent system designed to support mental health clients, measuring MI skill competency, state inference ac-
professionals in analyzing House-Tree-Person (HTP) drawings curacy, topic exploration proficiency, and overall counseling
for early screening of left-behind children (LBCs) in rural success. Results demonstrate that CAMI outperforms existing
China. Recognizing the acute shortage of clinicians, PsyDraw methods and exhibits counselor-like realism, while ablation
employs specialized agents for detailed feature extraction studies highlight the essential contributions of the state in-
and psychological interpretation in two stages: comprehensive ference and topic exploration modules to its superior perfor-
analysis of drawing elements and automated generation of mance.
professional reports. Evaluated on 290 primary-school HTP Steenstra et al. [149] address the challenges in therapeutic
submissions, PsyDraw achieved High Consistency with expert counseling training by proposing an innovative LLM-powered
evaluations in 71.03% of cases and Moderate Consistency system that provides continuous, detailed feedback during
in 26.21%, flagging 31.03% of children as needing further simulated patient interactions. Focusing on motivational in-
attention. Deployed in pilot schools, PsyDraw demonstrates terviewing a counseling approach emphasizing empathy and
strong potential as a scalable, preliminary screening tool that collaborative behavior change the framework features a sim-
maintains high professional standards and addresses critical ulated patient and visualizations of turn-by-turn performance
mental health gaps in resource-limited settings. to guide counselors through role-play scenarios. The system
Lee et al. [156] introduce PSYCHE, a comprehensive frame- was evaluated with both professional and student counselors,
work for benchmarking psychiatric assessment conversational who reported high usability and satisfaction, indicating that
agents (PACAs) built on large language models. Recognizing frequent and granular feedback can significantly enhance the
that psychiatric evaluations rely on nuanced, multi-turn inter- learning process compared to traditional, intermittent methods.
actions between clinicians and patients, PSYCHE simulates Abbasi et al. [153] introduce HamRaz, the first Persian-
patients using a detailed psychiatric construct that specifies language dataset tailored for Person-Centered Therapy (PCT)
their profiles, histories, and behavioral patterns. This approach with large language models (LLMs), addressing a critical
enables clinically relevant assessments, ensures ethical safety gap in culturally and linguistically appropriate mental health
checks, facilitates cost-efficient deployment, and provides resources. Recognizing that existing counseling datasets are
quantitative evaluation metrics. The framework was validated largely confined to Western and East Asian contexts, the
in a study involving ten board-certified psychiatrists who authors design HamRaz by blending scripted therapeutic dia-
reviewed and rated the simulated interactions, demonstrating logues with adaptive LLM-driven role-playing to foster coher-
PSYCHE’s ability to rigorously evaluate PACAs’ clinical ent, dynamic therapy sessions in Persian. To rigorously assess
appropriateness and safety. performance, they propose HamRazEval, a dual evaluation
Xu et al. [155] addresses the limitations of existing LLM- framework combining general dialogue quality metrics with
based Cognitive Behavioral Therapy (CBT) systems, namely the Barrett–Lennard Relationship Inventory (BLRI) to measure
their rigid agent structures and tendency toward redundant, un- therapeutic rapport and effectiveness. Experimental compar-
helpful suggestions, by proposing AutoCBT, a dynamic multi- isons demonstrate that LLMs trained on HamRaz generate
agent framework for automated psychological counseling. Ini- more empathetic, contextually nuanced, and realistic counsel-
tially, the authors develop a general single-turn consultation ing interactions than conventional Script Mode or Two-Agent
agent using Quora-like and YiXinLi models, evaluated on Mode approaches.
a bilingual dataset to benchmark response quality in single- c) General Medical Assistants, Clinical Workflow & De-
round interactions. Building on these insights, they introduce cision Making: Yun et al. [164] introduce an end-to-end
dynamic routing and supervisory mechanisms modeled af- framework for generating synthetic users to evaluate inter-
ter real-world counseling practices, enabling agents to self- active agents aimed at promoting positive behavior change,
optimize and tailor interventions more effectively. Experimen- focusing on sleep and diabetes management. The framework
tal results demonstrate that AutoCBT generates higher-quality first generates structured data based on real-world health and
CBT-oriented responses compared to fixed-structure systems, lifestyle factors, demographics, and behavioral attributes. Next,
highlighting its potential to deliver scalable, empathetic, and it creates complete user profiles conditioned on this structured
contextually appropriate psychological support for users who data. Interactions between synthetic users and health coaching
might otherwise avoid in-person therapy. agents are simulated using generative agent models such as
23

Concordia or by directly prompting a language model. Case PA-Img2Mol for converting chemical structure images into
studies with sleep and diabetes coaching agents demonstrate molecular representations, and PA-CoreId for identifying core
that the synthetic users enable realistic dialogue by accurately chemical scaffolds. PA-Img2Mol achieves accuracy gains of
reflecting users’ needs and challenges. Blinded evaluations by 2.46 to 8.37 percent across CLEF, JPO, UOB, and USPTO
human experts confirm that these health-grounded synthetic patent image benchmarks, while PA-CoreId delivers improve-
users portray real human users more faithfully than generic ments of 7.15 to 7.62 percent on the PatentNetML scaffold
synthetic users. This approach provides a scalable and realistic identification task. By combining these modules within a
testing ground for developing and refining conversational unified framework, PatentAgent addresses the full spectrum
agents in health and lifestyle coaching. of patent analysis needs, from extracting detailed experimental
Chen et al. [163] address the complexity of clinical decision- insights to pinpointing key molecular structures, and offers a
making in inpatient pathways by introducing both a new powerful tool to accelerate research and innovation in drug
benchmark and a multi-agent AI framework. The authors con- discovery.
struct the Inpatient Pathway Decision Support (IPDS) bench- Averly et al. [160] introduce LIDDiA, an autonomous in
mark from the MIMIC-IV database, comprising 51,274 cases silico agent designed to navigate the entire drug discovery
across nine triage departments, 17 disease categories, and 16 pipeline by leveraging the reasoning capabilities of large
standardized treatment options to capture the multifaceted na- language models. Unlike prior AI tools that address individual
ture of inpatient care. Building on this resource, they propose steps such as molecule generation or property prediction,
the Multi-Agent Inpatient Pathways (MAP) framework, which LIDDiA orchestrates the end-to-end process from target selec-
employs a triage agent for patient admission, a diagnosis agent tion through lead optimization. The authors evaluate LIDDiA
for department-level decision-making, and a treatment agent on 30 clinically relevant targets and show that it generates
for care planning, all coordinated by a chief agent that oversees candidate molecules satisfying key pharmaceutical criteria in
the entire pathway. In extensive experiments, MAP achieves over 70 percent of cases. Furthermore, LIDDiA demonstrates
a 25.10% improvement in diagnostic accuracy over the state- an intelligent balance between exploring novel chemical space
of-the-art LLM HuatuoGPT2-13B and surpasses three board- and exploiting known scaffolds and successfully identifies
certified clinicians in clinical compliance by 10–12%. These promising new inhibitors for the epidermal growth factor
results demonstrate the potential of multi-agent systems to receptor (EGFR), a major oncology target.
support complex inpatient workflows and lay the groundwork Inoue et al. [162] present a multi-agent framework designed
for future AI-driven decision support in hospital settings. to accelerate drug repurposing by combining machine learning
and knowledge integration. The system includes three spe-
cialized agents: an AI Agent that trains robust drug–target
PatentAgent
[161] MAP
Framework
interaction (DTI) models, a Knowledge Graph Agent that
LIDDiA
[160]
[163]
extracts DTIs from databases such as DGIdb, DrugBank, CTD
Synthetic
Users and STITCH, and a Search Agent that mines biomedical
[164]
Drug
Repurposing
literature to validate computational predictions. By integrating
[162]
outputs from these agents, the framework leverages diverse
General Medical
Pharmaceutical &
Drug-Related
Assistants,
Clinical Workflow
data sources to identify promising candidates for repurposing.
Agents
& Decision Making
Preliminary evaluations indicate that this approach not only
enhances the accuracy of drug–disease interaction predictions
compared to existing methods but also reduces the time
Healthcare HamRaz and cost associated with traditional drug discovery. The in-
Applications [153]
terpretable results and scalable architecture demonstrate the
CoD Scaffolding potential of multi-agent systems to drive innovation and effi-
[146] [149]
ciency in biomedical research.
Clinical Diagnosis, Mental Health,
Imaging &
Decision Support
Counseling &
Therapy Agents
2) Materials Science: Materials science has recently ben-
ZODIAC CAMI
[147] [154] efited from the integration of LLM-based agents, which are
helping to automate complex scientific workflows and enhance
MedAgent-Pro
[148]
AutoCBT
[155]
research efficiency. In this subsection, we highlight two no-
Script
Planning table developments, including the application of AI agents
M3Builder [159] PSYCHE
[150]
MEDDxAgent
PathFinder EvoPatient
PsyDraw
[156] in astronomical observations to streamline data collection and
[152] [158]
[151] [157]
analysis, and the creation of specialized agent systems tailored
to address the unique challenges of materials science research.
Fig. 8: Agent LLM Applications for Healthcare a) LLM-Based Agents for Astronomical Observations:
The StarWhisper Telescope System [132] leverages LLM-
d) Pharmaceutical & Drug-Related Agents: Wang et al. based agents to streamline the complex workflow of astro-
[161] introduce PatentAgent, the first end-to-end intelligent nomical observations within the Nearby Galaxy Supernovae
agent designed to streamline pharmaceutical patent analysis Survey (NGSS) project. This innovative system automates crit-
by leveraging large language models. PatentAgent integrates ical tasks including generating customized observation lists,
three core modules: PA-QA for patent question answering, initiating telescope observations, real-time image analysis, and
24

formulating follow-up proposals to reduce the operational steps, all within a thinking token framework that fosters
burden on astronomers and lower training costs. By integrating iterative feedback loops.
these agents into the observation process, the system can effi- c) Biomedical AI Scientist Agents: Lin et al. [165] intro-
ciently verify and dispatch observation lists, analyze transient duce BioKGBench, a novel benchmark designed to evaluate
phenomena in near real-time, and seamlessly communicate biomedical AI scientist agents from the perspective of litera-
results to observatory teams for subsequent scheduling. ture understanding. Unlike traditional evaluation methods that
b) Materials Science Research: HoneyComb [133] is rely solely on direct QA or biomedical experiments, BioKG-
introduced as the first LLM-based agent system tailored ex- Bench decomposes the critical ability of “understanding liter-
plicitly for materials science, addressing the unique challenges ature” into two atomic tasks: one that verifies scientific claims
posed by complex computational tasks and outdated implicit in unstructured text from research papers and another that in-
knowledge that often lead to inaccuracies and hallucinations volves interacting with structured knowledge-graph question-
in general-purpose LLMs. The system leverages a novel, high- answering (KGQA) for literature grounding. Building on these
quality materials science knowledge base (MatSciKB) curated components, the authors propose a new agent task called
from reliable literature and a sophisticated tool hub (Tool- KGCheck, which uses domain-based retrieval-augmented gen-
Hub) that employs an Inductive Tool Construction method eration to identify factual errors in large-scale knowledge
to generate, decompose, and refine specialized API tools. graph databases. With a dataset of over 2,000 examples for
Additionally, the retriever module adaptively selects the most the atomic tasks and 225 high-quality annotated samples for
relevant knowledge sources and tools for each task, ensuring the agent task, the study reveals that state-of-the-art agents
high accuracy and contextual relevance. both in everyday and biomedical settings perform poorly or
3) Biomedical Science: The biomedical field has seen suboptimally on this benchmark.
important progress through the development of LLM-based 4) Research Applications: LLM-based agents are increas-
agents designed to support knowledge discovery, enhance ingly being developed to support and automate various aspects
reasoning capabilities, and evaluate scientific literature. In of the scientific research process. This subsection presents
this subsection, we review recent contributions that focus on a selection of recent applications, including collaborative re-
gene set analysis, iterative learning for improved reasoning, search environments, automated survey generation, structured
and the evaluation of AI scientist agents through specialized literature analysis for ideation, workflow management in data
biomedical benchmarks. science, and AI-driven hypothesis generation.
a) Gene Set Knowledge Discovery: Gene set knowl- a) Collaborative Research Among LLM Agents:
edge discovery is crucial for advancing human functional Schmidgall and Moor [166] introduces AgentRxiv, a frame-
genomics, yet traditional LLM approaches often suffer from work designed to enable collaborative research among au-
issues like hallucinations. To address this, Wang et al. [134] tonomous LLM agent laboratories by leveraging a shared
introduce GeneAgent a pioneering language agent with self- preprint server. Recognizing that scientific discovery is inher-
verification capabilities that autonomously interacts with bio- ently incremental and collaborative, AgentRxiv allows agents
logical databases and leverages specialized domain knowledge to upload and retrieve research reports, thereby sharing in-
to enhance accuracy. Benchmarking on 1,106 gene sets from sights and building upon previous work in an iterative manner.
diverse sources, GeneAgent consistently outperforms standard The study demonstrates that agents with access to prior
GPT-4, and a detailed manual review confirms that its self- research achieve a significant performance boost an 11.4%
verification module effectively minimizes hallucinations and relative improvement on the MATH-500 dataset compared to
produces more reliable analytical narratives. Moreover, when those operating in isolation. Furthermore, the best-performing
applied to seven novel gene sets derived from mouse B2905 collaborative strategy generalizes to other domains with an
melanoma cell lines, expert evaluations reveal that GeneAgent average improvement of 3.3%, and when multiple agent lab-
offers novel insights into gene functions, significantly ex- oratories share their findings, overall accuracy increases by
pediting the process of knowledge discovery in functional 13.7% relative to the baseline. These findings highlight the
genomics. potential of autonomous agents to collaborate with humans,
b) Reasoning with Recursive Learning: Buehler et al. paving the way for more efficient and accelerated scientific
[135] proposed a framework, named PRefLexOR, that fuses discovery.
preference optimization with reinforcement learning concepts b) Automated Survey Generation: Liang et al. [136]
to enable language models to self-improve through iterative, developed the SurveyX platform, which leverages the excep-
multi-step reasoning. The approach employs a recursive learn- tional comprehension and knowledge capabilities of LLMs
ing strategy in which the model repeatedly revisits and refines to overcome critical limitations in automated survey gener-
intermediate reasoning steps before producing a final output, ation, including finite context windows, superficial content
both during training and inference. Initially, the model aligns discussions, and the lack of systematic evaluation frameworks.
its reasoning with accurate decision paths by optimizing the Inspired by human writing processes, SurveyX decomposes
log odds between preferred and non-preferred responses while the survey composition process into two distinct phases:
constructing a dynamic knowledge graph through question Preparation and Generation. During the preparation phase,
generation and retrieval augmentation. In a subsequent stage, the system incorporates online reference retrieval and applies
rejection sampling is employed to refine the reasoning quality a novel preprocessing method, AttributeTree, to effectively
by generating in-situ training data and masking intermediate structure the survey’s content. In the subsequent Generation
25

TABLE VIII: Overview of AI Agent Applications for Research

Agent / Tool Year Use Case Primary Aim Methodology & Key Findings & Eval. Collab. Open Sci.
Workflow Metrics Frame- Platform
work

AgentRxiv [166] 2025 Collaborative Share and build upon Upload/retrieve via +11.4% on MATH-500 AgentRxiv Preprint
Research preprints across shared preprint server MATH-500; +3.3% benchmark server sharing
autonomous LLM with iterative updates. cross-domain; +13.7%
labs. multi-lab.

SurveyX [136] 2025 Survey Automate systematic Preparation (retrieval +0.259 content Content & Bibliographic Structured
Generation literature surveys with + AttributeTree) + quality; +1.76 citation citation APIs citations
high quality. Generation precision vs. baselines. scoring
(repolishing).

CoI Agent [137] 2024 Research Structure literature Sequential Expert-comparable Idea Arena CoI Cost-efficient
Ideation into progressive idea Chain-of-Ideas + Idea idea quality at $0.50 framework ideation
chains. Arena evaluation per idea.
protocol.

Data Interpreter 2024 Data Manage end-to-end, Hierarchical Graph +25% on InfiAgent Pipeline APIs Reproducible
[167] Science dynamic DS pipelines Modeling + InfiAgent-DABench DABench workflows
Workflows robustly. Programmable Node (75.9→94.9%); ML &
Generation. MATH gains.

AI Co-Scientist 2025 Scientific Generate and refine Seven specialized +300 Elo hypothesis Elo & Multi-agent Hypothesis
[168] Discovery research hypotheses agents with Elo quality; +27% novelty novelty pipeline publication
autonomously. tournaments and scores. scoring
meta-review.
Eval. Framework: Evaluation Framework; Collab. Platform: Collaboration Platform; Open Sci.: Open Science Support.

phase, a repolishing process refines the output to enhance verifies each subproblem to boost the robustness of code
the depth and accuracy of the study generated, particularly generation. Extensive experiments demonstrate significant per-
improving content quality and citation precision. Experimental formance gains achieving up to a 25% boost on InfiAgent-
evaluations reveal that SurveyX achieves a content quality DABench (increasing accuracy from 75.9% to 94.9%), as
improvement of 0.259 and a citation quality enhancement of well as improvements on machine learning, open-ended tasks,
1.76 over existing systems, bringing its performance close to and the MATH dataset highlighting its superior capability
that of human experts across multiple evaluation dimensions. in managing evolving task dependencies and real-time data
c) Structuring Literature for Research Ideation: Li et adjustments.
al. [137] introduce the Chain-of-Ideas (CoI) agent, a novel e) Automating Scientific Discovery: Google [168] intro-
LLM-based framework for automating research ideation by duced the AI co-scientist, a multi-agent system built on Google
structuring relevant literature into a chain that mirrors the DeepMind Gemini 2.0, designed to automate scientific dis-
progressive development within a research domain. The CoI covery by generating and refining novel research hypotheses.
agent addresses the challenge posed by the exponential growth The framework comprises seven specialized agents Supervisor,
of scientific literature, which overwhelms traditional idea- Generation, Reflection, Ranking, Evolution, Proximity, and
generation methods that rely on simple prompts or expose Meta-review that collaboratively manage tasks ranging from
models to raw, unfiltered text. By organizing information in a parsing research goals to conducting simulated debates and
sequential chain, the CoI agent enables LLMs to capture cur- organizing hypotheses. For example, the system employs a
rent advancements more effectively, enhancing their ability to Ranking Agent that uses pairwise Elo tournaments, boosting
generate innovative research ideas. Complementing this frame- hypothesis quality by over 300 Elo points. At the same time,
work is the Idea Arena, an evaluation protocol that assesses the Meta-review Agent’s feedback has been shown to increase
the quality of generated ideas from multiple perspectives, hypothesis novelty scores by 27%. In practical applications,
aligning closely with the preferences of human researchers. such as drug repurposing for acute myeloid leukemia and novel
Experimental results indicate that the CoI agent outperforms target discovery for liver fibrosis, the framework demonstrates
existing methods and achieves quality comparable to human significant performance improvements, paving the way for
experts, all while maintaining a low cost approximately $0.50 AI systems that can generate and iteratively refine scientific
per candidate idea and corresponding experimental design. hypotheses with expert-level precision.
d) Managing Data Science Workflows: Hong et al. [167] 5) Software Engineering: Software engineering has be-
propose Data Interpreter, an LLM-based agent that tackles come a significant area of application for LLM-based agents,
end-to-end data science workflows by addressing challenges with innovations spanning architecture design and verification
in solving long-term, interconnected tasks and adapting to systems, adaptive control, software analytics, and multi-agent
dynamic data environments. Unlike previous methods that collaboration. This subsection presents recent developments
focus on individual tasks, Data Interpreter leverages two key across a wide range of tasks, including agent programming
modules: Hierarchical Graph Modeling, which decomposes frameworks, tutoring systems, automated environment config-
complex problems into manageable subproblems through dy- uration, usability testing, and multilingual code generation.
namic node generation and graph optimization, and Pro- Fig. 9 presents a classification of Agent LLM Applications
grammable Node Generation, which iteratively refines and for Software Engineering.
26

TABLE IX: Overview of AI Agent Applications for Software Engineering

Agent / Year SE Primary Objective Architecture & Workflow Key Outcomes & Metrics Bench. Intgr. Std.
Tool Domain

Ann Arbor 2025 Agent Treat LLMs as automata, Introduces the Ann Arbor Early experiments show
Architec- Program- enabling programming via conceptual framework and improved in-context learning.
ture [169] ming formal and natural languages. Postline platform.
Arch.

AgentGym 2025 Verification Scalable training of SWE-agents Leverages SYNGEN synthetic Achieves 51% pass rate on
[170] & Super- via SYNGEN data curation and data and Hybrid Test-time SWE-Bench Verified.
vision Hybrid Test-time Scaling. Scaling on SWE-Gym; trained
on SWE-Bench Verified.

TRAVER&DICT2025 Intelligent Trace-and-Verify workflow for Combines knowledge tracing Significant improvements in
[171] Tutoring stepwise coding guidance; with turn-by-turn verification; coding-tutoring success rates.
DICT evaluation protocol. evaluated via DICT protocol.

CURA 2025 Code Verbal Process Supervision for Integrates VPS modules with +3.65% on BigCodeBench with
[172] Reason- code understanding and LLM to guide reasoning over o3-mini.
ing reasoning. code.

DARS 2025 Performance Dynamic Action Re-Sampling Branches on execution feedback 55% pass@k and 47% pass@1
[173] Enhance- to branch inference at decision to explore alternative actions. on SWE-Bench Lite (Claude 3.5
ment points. Sonnet V2).

LocAgent 2025 Code Lo- Graph-based code representation Parses code into heterogeneous 92.7% file-level accuracy; +12%
[174] calization for multi-hop localization. graphs for reasoning over GitHub issue resolution.
dependencies.

GateLens 2025 Release NL→Relational-Algebra Automates query translation and 80% reduction in analysis time
[175] Valida- conversion and Python code optimized code for data (automotive software).
tion generation for test-data analysis. processing.

Repo2Run 2025 Env. Con- Atomic Docker setup synthesis Synthesizes and tests 86.0% success on 420 Python
[176] figuration with dual-environment rollback. Dockerfiles; isolates failures via repos; +63.9% vs. baselines.
dual environments.

UXAgent 2025 Usability LLM-agent with browser Generates qualitative insights, Accelerates UX iteration and
[177] Testing connector to simulate thousands action logs, and recordings reduces upfront user
of users. before user studies. recruitment.

SWE-Gym 2024 Training Realistic Python tasks and unit Provides executable +19% resolve rate; 32.0% on
[178] Environ- tests for SWE-agent training. environments with tests and SWE-Bench Verified; 26.0% on
ment natural language descriptions. Lite.

Qwen2.5-xCoder2025 Multi-Agent Multilingual instruction tuning Agents collaborate to generate Outperforms on multilingual
[179] Collabo- via language-specific agents and refine multilingual programming benchmarks.
ration with memory. instructions.

SyncMind 2025 Collaboration Defines and benchmarks Introduces SyncBench with 24 Exposes performance gaps and
[180] Simula- out-of-sync scenarios to k real-world instances. guides improvements.
tion improve agent coordination.

CodeSim 2025 Code Plan verification and I/O Incorporates plan verification SOTA on HumanEval, MBPP,
[181] Genera- simulation for multi-agent and internal debugging via APPS, CodeContests.
tion synthesis & debugging. input/output simulation.
Bench.: Benchmarking; Intgr.: Integration & Deployment; Std.: Standards Compliance; : Partial; : Not Supported; : Supported.

a) Agent Programming Architectures: Dong et al. [169] b) Verification & Supervision Agents: The papers by Jain
explore prompt engineering for large language models (LLMs) et al. [170] , Wang et al.[171], and Chen et al. [172] contribute
from the perspective of automata theory, arguing that LLMs to advancing the use of large language models (LLMs) for real-
can be viewed as automata. They assert that just as automata world software engineering (SWE) tasks, intelligent tutoring,
must be programmed using the languages they accept, LLMs and code generation. Jain et al. [170] introduce AgentGym, a
should similarly be programmed within the scope of both comprehensive environment for training SWE-agents, address-
natural and formal languages. This insight challenges tradi- ing challenges in scalable curation of executable environments
tional software engineering practices, which often distinguish and test-time compute scaling. Their approach leverages SYN-
between programming and natural languages. The paper in- GEN, a synthetic data curation method, and Hybrid Test-time
troduces the Ann Arbor Architecture, a conceptual framework Scaling to improve performance on the SWE-Bench Verified
designed for agent-oriented programming of language models, benchmark, achieving a state-of-the-art pass rate of 51%.
which serves as a higher-level abstraction to enhance in- Wang et al. [171] propose a novel coding tutoring framework,
context learning beyond basic token generation. The authors Trace-and-Verify (TRAVER), combining knowledge tracing
also present Postline, their agent platform, and discuss early and turn-by-turn verification to enhance tutor agents’ guid-
results from experiments conducted to train agents within this ance toward task completion. Their work introduces DICT, a
framework. holistic evaluation protocol for tutoring agents, demonstrating
significant improvements in coding tutoring success rates.
27

SyncMind Multi-Agent
significant advancement in optimizing coding agent perfor-
[180] Collab
Framework
[179] TRAVER mance, reducing the need for extensive manual intervention
CodeSim SWE-Gym & DICT
[181] [178] [171]
and improving overall efficiency.
UXAgent
[177]
d) Code Localization & Software Analytics: The works
Multi-Agent
by Chen et al. [174] and Gholamzadeh et al. [175] contribute
Collab-
oration
significant advancements in the application of Large Language
&
Simulation
Domain-Specific
SWE
Repo2Run
[176] Models (LLMs) to improve software engineering tasks, such
Agents as code localization and release validation. Chen et al. [174]
GateLens
introduce LocAgent, a framework for code localization that
[175]
utilizes graph-based representations of codebases. By parsing
code into directed heterogeneous graphs, LocAgent captures
Code
Localiza- LocAgent
the relationships between various code structures and their
tion & [174]
Software dependencies, enabling more efficient and accurate local-
Analytics
ization through multi-hop reasoning. Their approach, when
applied to real-world benchmarks, demonstrates substantial
Software
Engineering
improvements in localization accuracy, achieving up to 92.7%
on file-level localization and enhancing GitHub issue res-
olution success rates by 12%. In comparison to state-of-
Adaptive
Control & the-art models, LocAgent provides similar performance at a
Performance
Enhance- significantly lower cost. On the other hand, Gholamzadeh
ment
DARS
[173]
et al. [175] present GateLens, an LLM-based tool designed
to improve release validation in safety-critical systems like
automotive software. GateLens automates the analysis of test
Verification data by converting natural language queries into Relational
&
Supervision Algebra expressions and generating optimized Python code,
Agent Pro- Agents
gramming which significantly accelerates data processing. In industrial
Architectures CURA
(VPS)
[172] evaluations, GateLens reduced analysis time by over 80%,
demonstrating strong robustness and generalization across
TRAVER
& DICT
[171]
different query types. This tool improves decision-making in
AgentGym
Ann Arbor
Archi-
[170] safety-critical environments by automating test result analysis,
tecture Postline
[169] Platform
[169]
thereby enhancing the scalability and reliability of software
systems in automotive applications.
Fig. 9: Agent LLM Applications in Software Engineering e) Domain-Specific SWE Agents: Hu et al. [176] in-
troduce Repo2Run, a novel LLM-based agent aimed at au-
tomating the environment configuration process in software
Finally, Chen et al. present CURA, a code understanding and development. Traditional methods for setting up environments
reasoning system augmented with verbal process supervision often involve manual work or rely on fragile scripts, which
(VPS). CURA achieves a 3.65% improvement on benchmarks can lead to inefficiencies and errors. Repo2Run addresses
like BigCodeBench and demonstrates enhanced performance these challenges by fully automating the configuration of
when paired with the o3-mini model. These works collectively Docker containers for Python repositories. The key innova-
push the boundaries of LLM applications in complex software tions of Repo2Run are its atomic configuration synthesis and
engineering tasks, intelligent tutoring, and reasoning-driven a dual-environment architecture, which isolates internal and
code generation. external environments to prevent contamination from failed
c) Adaptive Control & Performance Enhancement: Ag- commands. A rollback mechanism ensures that only fully
garwal et al. [173] introduce Dynamic Action Re-Sampling executed configurations are applied, and the agent generates
(DARS), a novel approach for scaling compute during infer- executable Dockerfiles from successful configurations. Eval-
ence in coding agents, aimed at improving their decision- uated on a benchmark of 420 Python repositories with unit
making capabilities. While existing methods often rely on tests, Repo2Run achieved an impressive success rate of 86.0%,
linear trajectories or random sampling, DARS enhances agent outperforming existing baselines by 63.9%.
performance by branching out at key decision points and Lu et al. [177] developed UXAgent, a tool that uses
selecting alternative actions based on the history of previous LLM-Agent technology and a universal browser connector to
attempts and execution feedback. This enables coding agents simulate thousands of users for automated usability testing.
to recover more effectively from sub-optimal decisions, lead- It enables user experience (UX) researchers to quickly iterate
ing to faster and more efficient problem-solving. The authors on study designs by providing qualitative insights, quantitative
evaluate DARS on the SWE-Bench Lite benchmark, achieving action data, and video recordings before engaging participants.
an impressive pass@k score of 55% with Claude 3.5 Sonnet Wang et al. [171] introduce TRAVER (Trace-and-Verify),
V2 and a pass@1 rate of 47%, surpassing current state-of- a novel agent workflow that combines knowledge tracing
the-art open-source frameworks. This approach provides a estimating a student’s evolving knowledge state with turn-
28

by-turn verification to ensure effective step-by-step guidance over 100 subcategories, and iterative instruction refinement
toward task completion. Alongside TRAVER, they propose via suggester-editor pairs. This process yields a dataset of
DICT, an automatic evaluation protocol that utilizes controlled 25 million prompt-response pairs covering diverse skills such
student simulation and code generation tests to assess the as text editing, coding, creative writing, and reading compre-
performance of tutoring agents holistically. SWE-Gym [178] is hension. When applied to fine-tune a Mistral-7B model, the
introduced as the first dedicated environment for training real- resulting Orca-3 model demonstrated significant performance
world software engineering (SWE) agents, designed around improvements ranging from 19% to 54% across benchmarks
2,438 Python task instances that include complete code- like MMLU, AGIEval, GSM8K, BBH, and AlpacaEval as well
bases, executable runtime environments, unit tests, and natural as a notable reduction in hallucinations for summarization
language task descriptions. This realistic setup allows for tasks. These findings underscore the potential of automated,
training language model–based SWE agents that significantly agentic synthetic data generation to enhance model capabili-
improve performance achieving up to 19% absolute gains in ties while reducing reliance on labor-intensive data curation,
resolve rate on popular test sets like SWE-Bench Verified and positioning AgentInstruct as a promising tool for advancing
Lite. Furthermore, the authors explore inference-time scaling LLM instruction tuning.
by employing verifiers trained on agent trajectories sampled
from SWE-Gym, which, when combined with their fine- MarketSenseAI
Agentic Crews FinSphere
tuned agents, achieve state-of-the-art performance of 32.0% [190]
[189]
[188]

on SWE-Bench Verified and 26.0% on SWE-Bench Lite. Multi-Agent

Collaboration
f) Multi-Agent Collaboration & Simulation: The works [187]

by Yang et al. [179], Guo et al. [180], and Islam et al. [181]
contribute significant advancements to the application of Large Agentic Financial
Modeling Stock Analysis &
Language Models (LLMs) in code understanding, collabora- Citation-Enhanced
CSA
& Risk
Management
Evaluation
Multi-Agent
Financial QA
tive software engineering, and code generation. Yang et al. [191] [186]

[180] propose a novel multi-agent collaboration framework to Trustworthy

Conversational
Financial
Reasoning
bridge the gap between different programming languages. By Shopping Agents & QA

leveraging language-specific agents that collaborate and share Finance

knowledge, their approach enhances multilingual instruction Applications

tuning, enabling the efficient transfer of knowledge across Structured

Finance &
Strategic
Behavior in
Competitive
languages. The Qwen2.5-xCoder model demonstrates supe- Automation
Markets

rior performance in multilingual programming benchmarks, Structured

Finance
Strategic
Behavior
Automation
showcasing its potential to reduce cross-lingual gaps. Guo [182] Market
Simulation
Sequential
Investment
[185]

Decision-Making
et al. [180] introduce SyncMind, a framework that defines
the out-of-sync problem in collaborative software engineer-
ing. Through their SyncBench benchmark, which includes
over 24,000 instances of out-of-sync scenarios from real-
world codebases, they highlight performance gaps in current TwinMarket
[183]
FinCon
[184]

LLM agents and emphasize the need for better collabora-

tion and resource-awareness in AI systems. Finally, Islam Fig. 10: Agent LLM Applications in Finance
et al. [181] present CodeSim, a multi-agent code generation
framework that addresses program synthesis, coding, and 7) Finance Applications: Finance is a dynamic domain
debugging through a human-like perception approach. By where the adoption of LLM-based agents has opened new
incorporating plan verification and internal debugging via avenues for automation, simulation, analysis, and decision
input/output simulation, CodeSim achieves state-of-the-art per- support. This subsection presents recent innovations that span
formance across multiple competitive benchmarks, including structured finance automation, market simulation, investment
HumanEval, MBPP, APPS, and CodeContests. Their approach decision-making, financial reasoning, stock analysis, and risk
demonstrates the potential for further enhancement when cou- management. Fig. 10 presents a classification of Agent LLM
pled with external debuggers, advancing the effectiveness of Applications for Finance.
code generation systems. a) Structured Finance and Automation: Wan et al. [182]
6) Synthetic data generation: Mitra et al. [138] propose investigate the integration of artificial intelligence into struc-
AgentInstruct, a novel framework that leverages synthetic data tured finance, where the process of restructuring diverse as-
for post-training large language models through a process sets into securities such as MBS, ABS, and CDOs presents
termed ”Generative Teaching.” Recognizing the challenges substantial due diligence challenges. The authors demonstrate
posed by the varying quality and diversity of synthetic data and that AI, specifically large language models (LLMs), can effec-
the extensive manual curation typically required AgentInstruct tively automate the verification of information between loan
automates the creation of high-quality instructional datasets applications and bank statements. While close-sourced mod-
using a multi-agent workflow. Starting from raw unstructured els like GPT-4 achieve superior performance, open-sourced
text and source code, the framework employs successive stages alternatives such as LLAMA3 provide a more cost-effective
of content transformation, seed instruction generation across option. Furthermore, implementing dual-agent systems has
29

been shown to further increase accuracy, albeit with higher agent approach significantly boosts performance, with an av-
operational costs. erage increase of 15% for the LLaMA3-8B model and 5% for
b) Market Simulation: Yang et al. [183] introduce Twin- the LLaMA3-70B model, compared to single-agent systems.
Market, a multi-agent framework that harnesses large language Moreover, the proposed system performs comparably to and
models (LLMs) to simulate complex socio-economic systems, sometimes exceeds the capabilities of much larger single-agent
addressing longstanding challenges in modeling human behav- models such as LLaMA3.1-405B and GPT-4o-mini, although
ior. Traditional rule-based agent-based models often fall short it slightly lags behind Claude-3.5 Sonnet.
in capturing the irrational and emotionally driven aspects of f) Stock Analysis and Evaluation: Han et al. [187]
decision-making emphasized in behavioral economics. Twin- present a novel multi-agent collaboration system designed to
Market leverages the cognitive biases and dynamic emotional enhance financial analysis and investment decision-making by
responses inherent in LLMs to create more realistic simula- leveraging the collaborative potential of multiple AI agents.
tions of socio-economic interactions. The study illustrates how Moving beyond traditional single-agent models, the system
individual agent behaviors can lead to emergent phenomena features configurable agent groups with diverse collaboration
such as financial bubbles and recessions when combined structures that dynamically adapt to varying market conditions
through feedback mechanisms through experiments conducted and investment scenarios through a sub-optimal combination
in a simulated stock market environment. strategy. The study focuses on three key sub-tasks funda-
c) Sequential Investment Decision-Making: Yu et al. mentals, market sentiment, and risk analysis applied to the
[184] propose FinCon, an LLM-based multi-agent framework 2023 SEC 10-K forms of 30 companies from the Dow Jones
designed to tackle the complexities of sequential financial Index. Experimental findings reveal significant performance
investment decision-making. Recognizing that effective in- improvements with multi-agent configurations compared to
vestment requires dynamic interaction with volatile environ- single-agent approaches, demonstrating enhanced accuracy,
ments, FinCon draws inspiration from real-world investment efficiency, and adaptability.
firm structures by establishing a manager-analyst communi- In a related study, Han et al. [188] introduce FinSphere,
cation hierarchy. This design facilitates synchronized, cross- a conversational stock analysis agent designed to overcome
functional collaboration through natural language interactions two major challenges faced by current financial LLMs: their
while endowing each agent with enhanced memory capacity. A insufficient depth in stock analysis and the lack of objec-
key component is the risk-control module, which periodically tive metrics for evaluating the quality of analysis reports.
triggers a self-critiquing mechanism to update systematic The authors make three significant contributions. First, they
investment beliefs, thereby reinforcing future agent behavior present Stocksis, a dataset curated by industry experts to
and reducing unnecessary communication overhead. FinCon enhance the stock analysis capabilities of LLMs. Second,
exhibits strong generalization across various financial tasks, they propose Analyscore, a systematic evaluation framework
such as stock trading and portfolio management, and offers a that objectively assesses the quality of stock analysis reports.
promising approach to synthesizing multi-source information Third, they develop FinSphere, an AI agent that leverages
for optimized decision-making in dynamic financial markets. real-time data feeds, quantitative tools, and an instruction-
d) Strategic Behavior in Competitive Markets: Li et al. tuned LLM to generate high-quality stock analysis in response
[185] investigate the strategic behavior of large language to user queries. Experimental results indicate that FinSphere
models (LLMs) when deployed as autonomous agents in outperforms general and domain-specific LLMs and existing
multi-commodity markets within the framework of Cournot agent-based systems, even when these systems are enhanced
competition. The authors examine whether these models can with real-time data and few-shot guidance.
independently engage in anti-competitive practices, such as Fatouros et al. [189] introduce MarketSenseAI, an inno-
collusion or market division, without explicit human interven- vative framework for comprehensive stock analysis that har-
tion. Their findings reveal that LLMs can monopolize specific nesses large language models (LLMs) to integrate diverse
commodities by dynamically adjusting pricing and resource financial data sources ranging from financial news, historical
allocation strategies, thereby maximizing profitability through prices, and company fundamentals to macroeconomic indica-
self-directed strategic decisions. These results present signif- tors. Leveraging a novel architecture that combines Retrieval-
icant challenges and potential opportunities for businesses Augmented Generation with LLM agents, MarketSenseAI
incorporating AI into strategic roles and regulatory bodies processes SEC filings, earnings calls, and institutional reports
responsible for maintaining fair market competition. to enhance macroeconomic analysis. The latest advancements
e) Financial Reasoning and QA: Fatemi et al. [186] in the framework yield significant improvements in funda-
address the limitations of large language models (LLMs) in mental analysis accuracy over its previous iteration. Empirical
financial question-answering (QA) tasks that require complex evaluations on S&P 100 stocks (2023–2024) reveal cumulative
numerical reasoning. Recognizing that multi-step reasoning returns of 125.9% versus the index’s 73.5%, while tests on
is essential for extracting and processing information from S&P 500 stocks in 2024 show a 33.8% higher Sortino ratio,
tables and text, the authors propose a multi-agent framework underscoring the scalability and robustness of this LLM-driven
incorporating a critical agent to evaluate the reasoning process investment strategy.
and final answers. The framework is further enhanced with g) Agentic Financial Modeling and Risk Management:
multiple critic agents specializing in distinct aspects of the Okpala et al. [190] examine integrating large language models
answer evaluation. Experimental results show that this multi- into agentic systems within the financial services industry,
30

focusing on automating complex modeling and model risk into manageable sub-tasks and compiling them into a struc-
management (MRM) tasks. The authors introduce the concept tured memory library that can be referenced and refined in
of agentic crews, where teams of specialized agents, coordi- future queries. The framework incorporates three types of
nated by a manager, collaboratively execute distinct functions. memory and a library-enhanced reasoning component, en-
The modeling crew handles tasks such as exploratory data abling the system to improve over time through experience.
analysis, feature engineering, model selection, hyperparameter Evaluations on four SciBench chemical reasoning datasets
tuning, training, evaluation, and documentation, while the reveal that ChemAgent achieves performance gains of up to
MRM crew focuses on compliance checks, model replication, 46% with GPT-4, significantly outperforming existing methods
conceptual validation, outcome analysis, and documentation. and suggesting promising applications in fields such as drug
The effectiveness and robustness of these agentic workflows discovery and materials science.
are demonstrated through numerical examples applied to b) Materials Discovery & Design: By collaborating with
datasets in credit card fraud detection, credit card approval, materials science experts, Kumbhar et al. [193] curate a novel
and portfolio credit risk modeling, highlighting the potential dataset from recent journal publications that encapsulate real-
for autonomous decision-making in financial applications. world design goals, constraints, and methodologies. Using
h) Trustworthy Conversational Shopping Agents: Zeng this dataset, they test LLM-based agents to generate viable
et al. [191] focuses on enhancing the trustworthiness of LLM- hypotheses to achieve specified objectives under given con-
based Conversational Shopping Agents (CSAs) by addressing straints. To rigorously assess the relevance and quality of these
two key challenges: the generation of hallucinated or unsup- hypotheses, a novel scalable evaluation metric is proposed
ported claims and the lack of knowledge source attribution. To that mirrors the critical assessment process of materials scien-
combat these issues, the authors propose a production-ready tists. Together, the curated dataset, the hypothesis generation
solution that integrates a ”citation experience” through In- method, and the evaluation framework provide a promising
context Learning (ICL) and Multi-UX-Inference (MUI). This foundation for future research to accelerate materials discovery
approach enables CSAs to include citation marks linked to and design using LLM. ChemAgent is a novel framework
relevant product information without disrupting user experi- that aims to enhance chemical reasoning by leveraging large
ence features. Additionally, the work introduces automated language models through a dynamic, self-updating library.
metrics and scalable benchmarks to evaluate the grounding and
attribution capabilities of LLM responses holistically. Exper- Agent Trading
imental results on real-world data indicate that incorporating Arena
[202]

this citation generation paradigm enhances response grounding

by 13.83%, ultimately improving transparency and building
customer trust in conversational AI within the e-commerce
domain. Numerical
8) Chemical Reasoning: The domain of chemical reasoning Reasoning

poses complex challenges for large language models, includ-

ing precise information processing, task decomposition, and
integrating scientific knowledge and code. In this subsection, PACE
[201]

we highlight recent advances in developing LLM-based agents Solving Educational &

Mathematical Tutoring
for chemical reasoning and materials discovery. Problems
Applications

a) Chemical Reasoning & Information Processing: The MATHVC

[200]

paper by Cho et al. [192] addresses the challenges of deploying

MACM
large language model (LLM)–powered agents in resource- [194]

constrained environments, particularly for specialized domains Mathematical

Reasoning
& Problem Solving
and less-common languages, by introducing Tox-chat a Korean
MathLearner
chemical toxicity information agent. It presents a context- [195]

efficient architecture utilizing hierarchical section search to

reduce token consumption and a scenario-based dialogue Prompt
Sampling
MA-LoT
[199]
[196]
generation methodology that distills tool-using capabilities
Flows KG-Proofs
from larger models. Experimental evaluations reveal that the [197] [198]

fine-tuned 8B-parameter model significantly surpasses untuned

models and baseline approaches in database faithfulness and Fig. 11: Agent LLM Applications in Solving Mathematical
user preference, offering promising strategies for developing Problems
efficient, domain-specific language agents under practical con-
straints. 9) Solving mathematical problems: Mathematical problem-
Chemical reasoning tasks, which involve complex multi- solving remains a fundamental challenge for large language
step processes and require precise calculations, pose unique models due to the need for structured reasoning, formal logic,
challenges for LLMs especially in handling domain-specific and precise numerical computation. In this subsection, we
formulas and integrating code accurately. ChemAgent [139] present recent efforts to enhance the mathematical capabilities
addresses these challenges by decomposing chemical tasks of LLM-based agents through novel prompting strategies,
31

TABLE X: Overview of AI Agent Applications for Mathematical Problem Solving

Agent / Tool Year Math Task Primary Objective Architecture & Key Outcomes & Proof Val. Solver Integr. Notation Sup.
Workflow Metrics

MACM [194] 2024 Advanced Solve multi-step math Multi-Agent MATH level 5
Reasoning problems with robust Conditional Mining accuracy increase from
generalization. prompting for iterative 54.68% to 76.73% on
refinement. GPT-4 Turbo.

MathLearner 2024 Inductive Enhance LLM Retrieval module plus +20.96% global
[195] Reasoning reasoning via procedural knowledge accuracy; solves
inductive retrieval and injection in inductive 17.54% previously
application. loop. unsolved problems.

Prompt Sampling 2024 Search Combine diverse Uniform sampling 43% fewer runs for
[196] Space prompting methods to over multiple prompt MATH-hard with
Expansion expand search space strategies; fewer maximal coverage.
efficiently. inference runs.

Flows [197] 2024 Reasoning Generate detailed Collaborative LLM Significant

Trace math reasoning traces ensemble with online improvement in
online. DPO and rollouts. reasoning quality
versus direct inference.

KG-Proof Agent 2025 Proof Con- Automate Integrates concept KG 34% success on
[198] struction formalization of with LLM to structure MUSTARDSAUCE;
proofs using lemmas and steps. 2–11% improvement
knowledge graphs. over baselines.

MA-LoT [199] 2025 Theorem Synergize Multi-agent 61.07% on

Proving natural-language chain-of-thought plus MiniF2F-Test (Lean4)
reasoning with Lean4 LoT-Transfer pipeline versus 22.95% for
verification feedback. in Lean4. GPT-4.

MATHVC [200] 2024 Educational Simulate group Virtual classroom with Realistic dialog;
Modeling discussions for diverse student-agents improves modeling
mathematical and meta planning. task performance.
modeling skills.

PACE [201] 2025 Personalized Tailor math instruction Felder-Silverman Higher engagement
Tutoring to learning styles with personas plus Socratic and outcomes versus
Socratic feedback. method and tailored traditional tutors.
data.

Agent Trading 2025 Numerical Improve numeric Virtual stock game Enhanced geometric
Arena [202] Reasoning inference with visual plus analysis over reasoning; validated
data and reflection. plots and charts. on NASDAQ dataset.
Proof Val.: Proof Validation; Solver Integr.: Solver & Assistant Integration; Notation Sup.: Notation & Formalism Support: : Partial; : Not Supported; : Supported.

collaborative agent systems, theorem proving, and knowledge ing information and applying prior knowledge to new tasks,
integration. Fig. 11 presents a classification of agent LLM the framework significantly outperforms traditional chain-of-
applications for solving mathematical problems. thought approaches. Specifically, it improves global accuracy
a) Mathematical Reasoning and Problem Solving: The by 20.96% and can solve 17.54% of mathematical problems
paper by Lei et al. [194] tackles the challenge of ad- that the baseline fails to address. A key framework component
vanced mathematical problem-solving in large language mod- is its efficient retrieval method, which enables the model to
els (LLMs), where performance significantly declines despite effectively incorporate external knowledge and support math-
recent advancements like GPT-4. While methods such as Tree ematical computations based on explicit written procedures.
of Thought and Graph of Thought have been explored to Lee et al. [196] investigate the limitations of traditional
enhance logical reasoning, they face notable limitations: their single prompting methods in large language models (LLMs)
effectiveness on complex problems is limited, and the need for mathematical reasoning and explore alternative prompting
for custom prompts for each problem restricts generalizability. strategies. It experimentally demonstrates that distinct prompt-
In response, the authors introduce the Multi-Agent System ing methods each probe unique search spaces, a differentiation
for Conditional Mining (MACM) prompting method. MACM that becomes more pronounced with increased problem com-
successfully addresses intricate, multi-step mathematical chal- plexity. To capitalize on this diversity, the study introduces
lenges and exhibits robust generalization across diverse mathe- an efficient sampling process that uniformly combines outputs
matical contexts. Notably, using MACM, the accuracy of GPT- from these varied methods, thereby expanding the overall
4 Turbo on level five problems in the MATH dataset improves search space and achieving improved performance with fewer
markedly from 54.68% to 76.73%, demonstrating its potential inference runs. Notably, for the particularly challenging prob-
to elevate LLM inferential capabilities substantially. lems in the MATH-hard subset, the approach reached maximal
Xie et al. [195] present an agent framework designed to search space utilization with approximately 43% fewer runs
enhance the mathematical reasoning abilities of large lan- compared to individual methods.
guage models (LLMs) through inductive reasoning. Drawing Deng et al. [197] introduce a novel approach to enhance
inspiration from the human learning process of generaliz- the generation of detailed and accurate reasoning traces in
32

large language models (LLMs), particularly for mathemati- to individual learner characteristics. PACE leverages the Felder
cal reasoning tasks. The authors propose an online learning and Silverman learning style model to simulate distinct student
framework termed ”Flows,” where component LLMs work personas, enabling the system to tailor teaching strategies
collaboratively and iteratively, engaging in incremental output to diverse learning styles a crucial factor for enhancing en-
production to build coherent solutions. Central to the approach gagement and comprehension in mathematics. Integrating the
is online Direct Preference Optimization (DPO) with rollouts, Socratic teaching method, PACE provides instant, reflective
which generates DPO pairs for each training example and feedback that encourages deeper cognitive processing and
updates the models in real-time. By directly comparing the critical thinking. The framework also involves constructing
quality of reasoning traces produced by this method against personalized teaching datasets and training specialized mod-
those generated by standard direct model inference, the study els, which facilitate identifying and adapting each student’s
demonstrates that the proposed Flow framework significantly unique needs. Extensive evaluations using multi-aspect criteria
improves LLM performance in mathematical reasoning. demonstrate that PACE outperforms traditional methods in
Li et al. [198] introduce a novel framework that augments personalizing the educational experience and boosting student
large language models (LLMs) with knowledge graphs to motivation and learning outcomes.
improve the construction and formalization of mathematical c) Numerical Reasoning: Ma et al. [202] investigate
proofs. The proposed approach tackles persistent challenges the limitations of large language models (LLMs) in handling
in automating the identification of key mathematical concepts, dynamic and unseen numerical reasoning tasks, mainly when
understanding their relationships, and embedding them within operating on plain-text data. To address this, the authors
rigorous logical frameworks. Experimental results show sig- introduce the Agent Trading Arena a virtual numerical game
nificant performance gains, with the framework achieving up simulating complex economic systems via zero-sum stock
to a 34% success rate on the MUSTARDSAUCE dataset on portfolio investments which better reflects real-world scenarios
o1-mini and consistently outperforming baseline models by where optimal solutions are not clearly defined. Experimental
2–11% across various benchmarks. results indicate that LLMs, including GPT-4o, face challenges
Wang et al. [199] introduce MA-LoT, a novel multi-agent with algebraic reasoning in textual formats, often focusing on
framework designed for the Lean4 theorem proving that it syn- local details at the expense of broader trends. In contrast,
ergizes high-level natural language reasoning with formal lan- when LLMs are provided with visual data representations,
guage verification feedback. Unlike traditional single-agent ap- such as scatter plots or K-line charts, they exhibit significantly
proaches that either generate complete proofs or perform tree enhanced geometric reasoning capabilities. This improvement
searches, MA-LoT leverages structured interactions among is further enhanced by incorporating a reflection module that
multiple agents to maintain long-term coherence and deeper facilitates the analysis and interpretation of complex data.
insight during proof generation. The framework employs a These findings are validated using the NASDAQ Stock dataset,
novel LoT-Transfer Learning training-inference pipeline that underscoring the value of visual inputs for bolstering numer-
harnesses long chain-of-thought processes’ emergent formal ical reasoning in LLMs.
reasoning abilities. Extensive experiments demonstrate that 10) Geography Applications: Yu et al. [203] introduce
MA-LoT achieves a 61.07% accuracy on the Lean4 ver- MineAgent, a modular framework designed to enhance the
sion of the MiniF2F-Test dataset, significantly outperforming capabilities of multimodal large language models (MLLMs)
baselines such as GPT-4 (22.95%), single-agent tree search in the domain of remote-sensing mineral exploration. This
methods (50.70%), and whole-proof generation techniques field presents significant challenges, including the need for
(55.33%). These results underscore the potential of integrating domain-specific geological knowledge and the complexity of
long chain-of-thought reasoning with formal verification to reasoning across multiple remote-sensing images, which is
enhance automated theorem proving. further complicated by long-context issues. MineAgent ad-
b) Educational and Tutoring Applications: Yue et al. dresses these challenges by incorporating hierarchical judg-
[200] introduce MATHVC, a pioneering virtual classroom ing and decision-making modules to improve multi-image
powered by large language models (LLMs) designed to en- reasoning and spatial-spectral integration. In addition, the
hance students’ mathematical modeling (MM) skills through authors propose MineBench, a specialized benchmark to eval-
collaborative group discussions. Recognizing that traditional uate MLLMs on mineral exploration tasks using geological
MM practice often suffers from uneven access to qualified and hyperspectral data. Extensive experiments demonstrate
teachers and resources, the authors leverage LLMs’ capabil- the effectiveness of MineAgent, showcasing its potential to
ities to simulate diverse student characters, each embody- significantly advance the use of MLLMs in the critical area of
ing distinct math-relevant properties. To ensure that these remote-sensing mineral exploration
simulated interactions mirror authentic student discussions, Ning et al. [204] introduce an autonomous geographic
the framework incorporates three key innovations: integrating information system (GIS) agent framework that utilizes large
domain-specific MM knowledge into the simulation, defining language models (LLMs) to perform spatial analyses and
a symbolic schema to ground character behaviors, and em- cartographic tasks. A significant research gap in the field has
ploying a meta planner to guide the conversational flow. been the ability of these agents to autonomously discover
Liu et al. [201] introduce the Personalized Conversational and retrieve the necessary geospatial data. The proposed
Tutoring Agent (PACE) for mathematics instruction, address- framework addresses this by generating, executing, and de-
ing a critical gap in intelligent educational systems by adapting bugging programs to select data sources from a predefined
33

list, using source-specific handbooks that document metadata film production, music and poetry generation, drama scripting,
and retrieval details. The framework is designed in a plug- fashion assistance, and lyric composition. Fig. 12 presents a
and-play style, allowing users or automated crawlers to easily classification of agent LLM applications for Multimedia.
add new data sources by creating additional handbooks. A a) Film Automation Agents: Xu et al. [205] introduce
prototype of the agent has been developed as a QGIS plugin FilmAgent, an innovative LLM-based multi-agent collabora-
and Python program. Experimental results demonstrate its tive framework designed to automate end-to-end film pro-
capability to retrieve data from various sources, including duction within 3D virtual spaces. Virtual film production
OpenStreetMap, U.S. Census Bureau demographic data, satel- involves complex decision-making, including scriptwriting,
lite basemaps from ESRI, global digital elevation models from cinematography, and actor positioning. FilmAgent simulates
OpenTopography, weather data, and COVID-19 case data from various crew roles such as directors, screenwriters, actors,
the NYTimes GitHub. This work is one of the first efforts to and cinematographers, covering crucial stages of the film
create an autonomous GIS agent for geospatial data retrieval, production process. These stages include idea development,
marking a significant advancement in spatial data automation. where brainstormed ideas are transformed into structured
story outlines; scriptwriting, which generates dialogues and
Melody-Lyric character actions; and cinematography, which determines the
Agents
[212] camera setups for each shot. The agents collaborate iteratively,
providing feedback and revisions to verify intermediate scripts
Multi-Agent
Poetry
and reduce hallucinations. Evaluations of the generated videos
Framework on 15 ideas across four key aspects show that FilmAgent
[211]
outperforms all baselines, achieving an average score of 3.98
Lyric
Generation out of 5. Despite using the GPT-4o model, FilmAgent sur-
Agents
passes the single-agent o1, demonstrating the benefits of a
Poetry
Generation MusicAgent coordinated multi-agent system.
[210]
Agents b) Story-to-Video Production Agents: Wang et al. [206]
Music introduce AesopAgent, an Agent-driven Evolutionary Sys-
Under- tem designed for story-to-video production, leveraging the
standing &
Generation advancements in Agent and Artificial Intelligence Generated
Agents
Content (AIGC) technologies. AesopAgent integrates multiple
Symbolic
ComposerX generative capabilities within a unified framework, enabling
[209]
Music users to easily convert story proposals into scripts, images,
Composition
Agents audio, and videos. The system orchestrates the entire video
Multimedia generation workflow, ensuring that the generated content is
Applications
both rich and coherent. The system consists of two layers:
Fashion-Domain
Conver- Fashion the Horizontal Layer and the Utility Layer. The Horizontal
sational
Agents
Assis-
tant
Layer incorporates a novel RAG-based evolutionary system
Eval. that continuously optimizes the video production process by
[208]
Drama
accumulating expert knowledge and refining workflow steps,
Script such as LLM prompt optimization. The Utility Layer provides
Generation
Agents essential tools for consistent image generation, ensuring visual
coherence in terms of composition, characters, and style, while
Story-to-Video
Production
IBSEN also integrating audio and special effects.
[207]
Agents c) Drama Script Generation Agents: Han et al. [207]
Film Au- introduce IBSEN, a director-actor coordination agent frame-
tomation
Agents work designed to generate drama scripts and provide greater
control over the plot development, especially in scenarios
AesopAgent
[206]
where human players are involved. While current language
model agents excel at creating individual behaviors for char-
acters, they often struggle with maintaining consistency and
FilmAgent coherence at the storyline level. IBSEN addresses this by
[205]
introducing a director agent that writes plot outlines based on
Fig. 12: Agent LLM Applications in Multimedia user input, instructs actor agents to role-play their respective
characters, and adjusts the plot as needed to ensure that
11) Multimedia Applications: Multimedia is an emerging the narrative progresses toward the intended objective. The
frontier for LLM-based agents, where creative and interpretive framework was evaluated using a novel drama plot involving
tasks require coordination across diverse modalities, including multiple actor agents, where the interactions were guided by
text, audio, image, and video. In this subsection, we present the director agent. The results demonstrate that IBSEN is ca-
recent advancements in applying agent-based language learn- pable of generating diverse and complete drama scripts from a
ing and machine learning (LLM) systems to domains such as rough plot outline, while preserving the unique characteristics
34

TABLE XI: Overview of AI Agent Applications for Multimedia

Agent/Tool Year Domain Primary Objective Architecture & Key Outcomes & Eval. Metrics Pipeline Fmt. Compat.
Workflow Metrics Integr.

FilmAgent [205] 2025 Film Fully automate Multi-agent roles Outperforms Mean user Virtual studio Exports
Automation end-to-end 3D virtual (director, single-agent baselines score 3.98/5 pipeline MP4/WebM
film production. screenwriter, actors, with coherent video support
cinematographer) across 15 scenarios.
with iterative
feedback loops.

AesopAgent 2024 Story→Video Convert story drafts Two-layer Rich, coherent Workflow Integrates with Supports
[206] into scripts, images, RAG-evolutionary multimodal outputs convergence AIGC asset PNG, WAV,
audio, and video. workflow plus utility with continuous rate ≈ 85 % generators MP4
layer for optimization.
image/audio/effects.

IBSEN [207] 2024 Drama Generate coherent Director agent Diverse, complete Narrative Scriptwriting Plain-text
Scripts drama scripts via outlines plot; actor scripts preserving coherence ¿ toolchain script output
director–actor agents role-play and character traits. 90% (human compatible
coordination. adjust narrative. eval)

Fashion-Agent 2024 Conversational Enhance online LLM front-end 4 000-dialog dataset; Precision@5: E-commerce JSON /
[208] Retail fashion discovery connects to search & improves retrieval 78% API HTML widget
with LLM dialogue recommendation relevance by 18 %. integration
agents. backends.

ComposerX [209] 2024 Music Multi-agent symbolic Agents specialize in Coherent polyphonic Subjective MIDI pipeline Standard
Composi- music generation with melody, harmony, and pieces rated high on rating 4.2/5 plugin MIDI files
tion harmony constraints. structure using LLM musicality.
reasoning.

MusicAgent 2023 Music Orchestrate diverse Autonomous task Simplifies tool use; Task Integrates WAV, MP3,
[210] Processing music tasks via decomposition and reduces development completion FFmpeg, MIDI
unified LLM agent. tool invocation over effort by 40 %. time ↓ 40 % Librosa, Web
HF/GitHub/APIs. APIs

PoetryAgents 2024 Poetry Boost diversity & Cooperative & +3.0–3.7 pp diversity; Distinct Text pipeline UTF-8 text
[211] Generation novelty in non-cooperative agent +5.6–11.3 pp novelty. n-gram ↑ 11% integration
LLM-generated interactions on
poetry via multi-agent GPT-2/3/4.
social learning.

LyricAgents 2024 Lyric Melody-to-lyric Agents for rhyme, Listening test Alignment Singing-synth LRC / JSON
[212] Generation alignment in tonal syllable, alignment & accuracy 85 %. score 0.87 pipeline ready lyric files
languages with consistency; evaluated
multi-agent sub-tasks. via singing synth.
Eval. Metrics: Evaluation Metrics; Pipeline Integr.: Pipeline Integration; Fmt. Compat.: Format Compatibility.

of each character, showing the effectiveness of the framework LLMs have demonstrated impressive performance in STEM
in producing controlled, dynamic narrative content. domains, they often struggle with music composition, par-
d) Fashion-Domain Conversational Agents: Maroniko- ticularly when dealing with long dependencies and harmony
lakis et al. [208] focus on the potential of Large Language constraints. Even when equipped with advanced techniques
Models (LLMs) to revolutionize online fashion retail by en- like In-Context Learning and Chain-of-Thought, LLMs typi-
hancing customer experiences and improving product discov- cally generate poorly structured music. ComposerX aims to
ery through conversational agents. These LLM-powered agents address this by leveraging the reasoning abilities of LLMs
allow customers to interact naturally, refining their needs and their extensive knowledge of music history and theory. By
and receiving personalized fashion and shopping advice. For employing a multi-agent approach, the framework significantly
tasks like finding specific products, conversational agents must enhances the music composition quality of GPT-4. The results
translate customer interactions into calls to various backend show that ComposerX is capable of generating coherent,
systems, such as search engines, to display relevant product polyphonic music compositions with engaging melodies that
options. The authors emphasize the importance of evaluating follow user instructions, marking a substantial improvement in
the capabilities of LLMs in these tasks, particularly in integrat- the application of LLMs to creative music composition tasks.
ing with backend systems. However, existing evaluations are f) Music Understanding & Generation Agents: Yu et al.
often complex due to the lack of high-quality, relevant datasets [210] present MusicAgent, a system designed to streamline
that align with business needs. To address this, the authors AI-powered music processing by organizing and integrat-
developed a multilingual evaluation dataset comprising 4,000 ing diverse music-related tasks. Music processing spans a
conversations between customers and a fashion assistant on a wide range of activities, from generation tasks like timbre
large e-commerce platform. synthesis to comprehension tasks like music classification.
e) Symbolic Music Composition Agents: Deng et al. However, developers and amateurs often struggle to navigate
[209] introduce ComposerX, an agent-based symbolic music the complexity of these tasks, particularly due to the varying
generation framework designed to enhance the music compo- representations of music data and the applicability of different
sition capabilities of Large Language Models (LLMs). While models across platforms. MusicAgent addresses this challenge
35

by offering an integrated solution that simplifies the process C. AI Agents Protocols

for users. The system includes a comprehensive toolset that
Recent advances in autonomous AI systems have under-
gathers music tools from diverse sources such as Hugging
scored the importance of standardized communication proto-
Face, GitHub, and Web APIs. Additionally, it incorporates an
cols in facilitating seamless interaction among agents, tools,
autonomous workflow powered by Large Language Models
and external services. In this subsection, we present three
(LLMs), like ChatGPT, which organizes these tools and au-
prominent protocols developed between 2024 and 2025: Agent
tomatically decomposes user requests into sub-tasks, invoking
Communication Protocol (ACP), Model Context Protocol
the appropriate tools. The primary goal of MusicAgent is to
(MCP), and Agent-to-Agent Protocol (A2A).
alleviate users from the technicalities of using AI-based music
tools, allowing them to focus on the creative aspects of music. 1) Agent Communication Protocol (ACP): In 2025, IBM
Research proposed the agent-to-agent communication protocol
g) Poetry Generation Agents: Zhang et al. [211] intro- named ACP, which is central to the operation of BeeAI1 ,
duces a framework for enhancing the diversity and novelty of an experimental platform designed to streamline the orches-
poetry generated by Large Language Models (LLMs) by in- tration and execution of open-source AI agents, regardless
corporating social learning principles. While LLMs have made of their underlying framework or code base. The primary
significant strides in automatic poetry generation, their outputs goal of ACP is to standardize communication between agents,
often lack the diversity and creativity seen in human-generated addressing challenges posed by inconsistent interfaces and
poetry. The proposed framework emphasizes both cooperative enabling seamless interaction across diverse agents and client
and non-cooperative interactions among multiple agents to systems. Inspired by Anthropic’s MCP, ACP initially aimed
foster diversity in generated poetry. This is the first attempt to to connect agents to data and tools but has since evolved to
apply multi-agent systems in non-cooperative environments for include advanced features such as discovery, delegation, and
poetry generation, utilizing both TRAINING-BASED agents multi-agent orchestration. Core components of BeeAI include
(GPT-2) and PROMPTING-BASED agents (GPT-3 and GPT- the BeeAI Server, which orchestrates agent processes in a
4). Experiments based on 96k generated poems show sig- local-first environment and provides a unified REST endpoint
nificant improvements, particularly for TRAINING-BASED for external apps and UIs, and the ACP SDKs, which offer
agents, with a 3.0–3.7 percentage point increase in diver- libraries in Python and TypeScript, along with a command-line
sity and a 5.6–11.3 percentage point increase in novelty, as interface and UI for easy agent discovery and launch [214].
measured by distinct and novel n-grams. The results also 2) Model Context Protocol (MCP): In late 2024, Anthropic
reveal that poetry generated by these agents shows increased introduced the Model Context Protocol (MCP), an open and
divergence in terms of lexicons, styles, and semantics. For flexible protocol that standardizes how AI systems interact
PROMPTING-BASED agents, the non-cooperative environ- with external tools and data sources, much like a USB-C
ment helps enhance diversity, with an increase of 7.0–17.5 port provides a universal connection for devices. Inspired by
percentage points, though these agents showed a decrease in the Language Server Protocol, MCP enables AI agents to
lexical diversity over time and did not exhibit the desired autonomously identify, select, and manage a wide range of ser-
group-based divergence. vices based on the specific context of each task. The protocol
h) Lyric Generation Agents: Liu et al. [212] address facilitates the development of complex workflows by offering
the challenges of melody-to-lyric generation by leveraging a growing catalog of pre-built integrations, the flexibility to
Generative Large Language Models (LLMs) and multi-agent switch between different LLM providers, and best practices
systems. Previous research in this area has been constrained for securing data within an organization’s infrastructure [216].
by limited high-quality aligned data and unclear standards for An expanding ecosystem of servers highlights the protocol’s
creativity. Many studies focused on broad themes or emotions, potential. For example, official reference servers demonstrate
which have limited value given the advanced capabilities MCP’s core capabilities through secure file management and
of current language models. In tonal contour languages like database access, utilizing PostgreSQL, SQLite, and Google
Mandarin, where pitch contours are influenced by both melody Drive. At the same time, development environments benefit
and tone, achieving a good fit between lyrics and melody from integration with tools such as Git, GitHub, and GitLab.
becomes more complex. The study, validated by the Mpop600 Moreover, MCP supports productivity and communication
dataset, demonstrates that lyricists and melody writers care- enhancements via integrations with platforms like Slack and
fully consider this fit during their composition process. To Google Maps and even extends to specialized AI tools, includ-
tackle this, the authors developed a multi-agent system that de- ing image generators and sophisticated search APIs2 .
composes the melody-to-lyric task into specific sub-tasks, with MCP is designed around a client-server architecture in
individual agents managing aspects such as rhyme, syllable which host applications connect to multiple lightweight servers
count, lyric-melody alignment, and consistency. The quality of [213]. This allows secure access to local data sources such as
the generated lyrics was evaluated through listening tests using files and databases and remote services available through web
a diffusion-based singing voice synthesizer, assessing how APIs. By unifying these interfaces, MCP transforms everyday
different agent groups performed in terms of lyric creation. platforms into versatile, multi-modal AI agents, simplifying
This work introduces a more structured approach to melody-
to-lyric generation, offering a deeper understanding of the 1 https://github.com/i-am-bee/beeai-framework

interaction between melody and lyrics in tonal languages. 2 https://github.com/modelcontextprotocol/servers

Allow a diverse selection of MCP Enabling dynamic, multimodal interactions among Enable agents to interface with tools, APIs,
servers to be integrated with various agents without requiring shared memory, and resources using standardized structured
agents. Agent A resources, or tools. Agent B inputs and outputs.
MCP Host MCP Host

A2A Server
A2A Client
Large Language Model (e.g.,
DeepSeek, Qwen, ...etc.)
Remote CrewAI
A2A protocol
MCP MCP MCP Agent Agent MCP MCP MCP

OpenRouter API
Server protocol Client Client protocol Server

Local Data Local Data

Source 1 Source 3

A2A Server
A2A Client
Remote A2A protocol LangChain
Agent Agent

MCP MCP MCP MCP MCP MCP

Server protocol Client Client protocol Server

A2A Server
Local Data

A2A Client
Local Data Remote Haystack Source 4

Agent Development Kit

Source 2 Agent A2A protocol Agent
Agent Framework

MCP MCP MCP MCP MCP MCP

A2A Server
A2A Client
Server protocol Client Client protocol Server
Microsoft
Remote A2A protocol AutoGen
Agent Remote
Agent
Remote Service
Service

Front-End Front-End

Web Browser - User Web Browser - User

Fig. 13: Multi-Agent Integration Framework: Enabling dynamic collaboration through the A2A and MCP Protocols.

the creation of AI-native applications and accelerating inno- crosoft AutoGen, which communicate via the A2A protocol.
vation across diverse domains. This communication method allows agents to collaborate
3) Agent-to-Agent Protocol (A2A): In 2025, Google intro- dynamically without sharing internal memories, resources, or
duced the Agent2Agent (A2A) protocol to usher in a new tools, ensuring secure and efficient inter-agent exchanges. In
era of seamless interoperability among AI agents, significantly parallel, the framework utilizes the MCP protocol to stan-
enhancing workplace productivity and automation [215]. The dardize interactions with various tools, APIs, and resources,
protocol is designed to facilitate dynamic collaboration be- enabling agents to connect with both local data sources and
tween autonomous agents, enabling them to work together remote services through structured inputs and outputs.
across isolated data systems and diverse applications regard- Tab. XII provides a comparative analysis of three agent
less of their underlying frameworks or vendors. Using familiar communication protocols: MCP, ACP, and A2A. It highlights
standards such as HTTP, SSE, and JSON-RPC, A2A simplifies their primary purpose, typical setup, core features, and ideal
integration with existing IT infrastructures while also ensuring use cases. MCP (Model Context Protocol) focuses on in-
robust enterprise-grade security through proven authentication tegrating data and tools into LLM workflows, providing a
and authorization practices. A2A supports both swift and standardized interface for delivering context. ACP (Agent
long-duration tasks by allowing agents to exchange real-time Communication Protocol), a component of the BeeAI plat-
updates, negotiate user interface requirements, and perform form, enables communication among multiple agents in a
capability discovery via structured ”Agent Cards. local-first setup, providing tools for agent discovery and
MCP is designed to connect agents with tools, APIs, and telemetry. In contrast, A2A (Agent-to-Agent Protocol) enables
resources through structured inputs and outputs. It is fully interoperability between agents across different frameworks,
supported by Google’s ADK, which enables a wide range of allowing them to exchange tasks and collaborate. The table
MCP servers to be seamlessly integrated with AI agents. In highlights the distinct roles these protocols play in agent-
parallel, A2A 3 provides a dynamic, multimodal framework based systems, with MCP focusing on data integration for
for agent-to-agent communication, allowing different agents to LLMs, ACP concentrating on local agent orchestration, and
collaborate without sharing memory, resources, or tools. Fig. A2A facilitating cross-platform collaboration among agents.
13 presents a sophisticated multi-agent integration framework
that leverages two key protocols A2A and MCP to enable D. Training datasets
seamless interactions among diverse agents and services. It High-quality training datasets are crucial for enhancing
depicts multiple remote agents, including those branded as the reasoning, multilingual understanding, and instruction-
CrewAI Agent, LangChain Agent, Haystack Agent, and Mi- following abilities of large language models. In this subsection,
we present three recently developed datasets: NaturalReason-
3 https://google.github.io/A2A/ ing, FineWeb2, and MagPie-Ultra. Each dataset addresses
37

TABLE XII: Comparison of MCP, ACP, and A2A Protocols

Feature MCP (Model Context Protocol) ACP (Agent Communication A2A (Agent-to-Agent
[213] Protocol)[214] Protocol)[215]
Main Purpose Facilitates access to context and Enables communication between multiple Facilitates communication and
data for LLMs agents within BeeAI task-sharing between agents across
frameworks
Common Setup Distributed servers providing spe- BeeAI Server coordinates and manages Agents from different frameworks
cific data, connected via an MCP multiple agents within a local environment discover and connect through
hub HTTP interfaces
Key Capabilities Standardized interface for connect- Simplifies agent deployment, discovery, and Allows agents to discover each
ing data and services to LLMs offers deep telemetry within BeeAI other’s capabilities and share tasks
with updates
Typical Application Managing context for LLMs and Managing multiple agents within BeeAI’s Enabling interaction and collabora-
integrating data streams environment tion between agents from diverse
systems
Core Objective Uniformly managing how LLMs Standardizing communication between Creating a standardized method for
receive context and external tools BeeAI agents and external systems agents from different systems to
communicate and collaborate
Architecture Client-server model where LLMs BeeAI Server orchestrates the interaction of Agents connect through agent
hook into servers for data and tools local agents and integrates external frame- cards and HTTP for task execution
works and communication
Key Differences Focuses on integrating tools and Primarily focused on internal coordination Aims at linking agents across dif-
data into a single LLM process of agents within BeeAI ferent ecosystems to collaborate ef-
fectively
Ideal Usage Scenario Integrating multiple data sources or Running and managing various agents Connecting agents from different
services into an LLM workflow within BeeAI’s environment platforms to enable collaboration
and task-sharing
Common Use Cases Implementing controlled, secure Orchestrating multi-agent environments Enabling task sharing and agent
LLM workflows with external data with BeeAI’s platform communication across different
vendor systems

different aspects of model training, ranging from expanding configurations, FineWeb2 employs innovative techniques such
reasoning across multiple domains to enhancing multilingual as ”re-hydration” deduplication and language-specific filtering
capabilities and advancing the generation of synthetic instruc- to ensure high data quality. Extensive ablation experiments,
tions. conducted with a 1.45 billion-parameter model trained on
1) NaturalReasoning dataset: Scaling reasoning capabili- 30 billion tokens, further validate the dataset’s robustness.
ties beyond traditional domains such as math and coding has In comparative evaluations against established datasets like
been challenging due to the scarcity of diverse, high-quality CC-100, mC4, CulturaX, and HPLT, FineWeb2 consistently
questions. In response, [217] introduces NaturalReasoning a outperforms across diverse languages. Additionally, special-
comprehensive dataset comprising 2.8 million questions that ized evaluations using the FineTasks benchmark on 9 varied
span multiple domains, including STEM fields (like Physics languages underscore its potential for advancing multilingual
and Computer Science), Economics, and Social Sciences, com- natural language processing and retrieval-augmented genera-
plete with reference answers. The dataset is designed not only tion applications.
to serve as a resource for knowledge distillation experiments, 3) MagPie-Ultra dataset: MagPie-Ultra [219] is a synthetic
where it effectively transfers reasoning capabilities from a dataset generated using Meta Llama 3.1 405 B-Instruct FP8,
strong teacher model, but also for unsupervised self-training representing the first open dataset of its kind. It comprises
using external reward models. When training the Llama3.1- 50,000 synthetic instruction pairs, created by prompting the
8B-Instruct model, NaturalReasoning demonstrates superior language model with minimal ”empty” prompts (only initial
scaling effects, achieving notably higher average performance special tokens) that allow it to generate both user queries and
on benchmarks such as MATH, GPQA, and MMLU-Pro corresponding responses auto-regressively. These generated
compared to other datasets. This work highlights the potential pairs, filtered according to the MagPie recipe and refined via
of a large, diverse question dataset to expand the boundaries Argilla distilabel, cover a diverse range of challenging tasks,
of LLM reasoning across a broader range of fields. including coding, mathematics, data analysis, creative writing,
2) FineWeb2 dataset: Hugging Face’s team introduced advice seeking, and brainstorming. In addition to the raw
[218] FineWeb2, a groundbreaking multilingual dataset com- instruction pairs, the dataset includes detailed metadata quality
prising 8TB of meticulously cleaned text data with over and difficulty scores, embeddings, topic labels, and safety as-
3 trillion non-English words drawn from more than 1,000 sessments from tools like ArmorRM and LlamaGuard, which
languages. FineWeb2 supports a total of 1,893 languages, with further support its use in training and evaluating large language
substantial coverage 486 languages include more than 1MB models across complex instruction-following scenarios.
of data and 80 languages boast over 1GB each demonstrating
its extensive linguistic diversity. Built upon 96 snapshots of V. C HALLENGES AND O PEN PROBLEMS
CommonCrawl data spanning 2013 to 2024 and processed As the capabilities of AI agents and large language models
using the ”datatrove” alongside sophisticated filtering code and continue to grow, new challenges and open problems emerge
38

that limit their effectiveness, reliability, and security [220]. In questions, background surveys, inspirations, and hypotheses,
this section, we highlight several critical research directions, across 12 disciplines. Expert validation ensures the reliability
including advancing the reasoning abilities of AI agents, of this framework. By exclusively using papers published in
understanding the failure modes of multi-agent systems, sup- 2024, the study minimizes data contamination from large lan-
porting automated scientific discovery, enabling dynamic tool guage model (LLM) pretraining datasets, revealing that LLMs
integration, reinforcing autonomous search capabilities, and perform notably well in retrieving novel inspirations. This
addressing the vulnerabilities of emerging communication positions LLMs as promising “research hypothesis mines”
protocols. that can facilitate the automation of scientific discovery by
generating innovative hypotheses at scale.
A. AI Agents Reasoning Despite these advances, significant challenges remain for AI
The primary challenge addressed in [221] is the inherent agents employing LLMs to automate scientific discovery. One
limitation of traditional Chain-of-Thought (CoT) methods, key obstacle is ensuring that these agents generate novel and
which only reveal the final reasoning steps without explicitly scientifically valid hypotheses, as they must navigate the risk
modeling the underlying cognitive process that leads to those of producing biased or spurious associations without sufficient
steps. Meta Chain-of-Thought (Meta-CoT) aims to fill this human oversight. Furthermore, the complexity and diversity
gap by capturing and formalizing the latent reasoning that of scientific literature across various disciplines demand that
underlies a Chain-of-Thought (CoT). This involves not only these agents not only understand domain-specific nuances but
generating the visible chain of thought but also understanding also adapt dynamically to evolving research contexts. The risk
the in-context search behavior and iterative reasoning steps of data contamination, particularly when recent publications
that contribute to it. To overcome these challenges, the authors might overlap with pretraining data, further complicates the
explore innovative approaches, including process supervision, extraction of truly innovative insights. In addition, scaling
synthetic data generation, and search algorithms, to produce these systems while preserving transparency, interpretability,
robust Meta-CoTs. Moreover, they propose a concrete train- and ethical standards poses a multifaceted challenge that must
ing pipeline that integrates instruction tuning with linearized be addressed to harness the potential of AI-driven scientific
search traces and reinforcement learning post-training. Open discovery fully.
research questions remain regarding scaling laws, the role of
verifiers, and the discovery of novel reasoning algorithms,
underscoring the complexity and potential of advancing more
human-like reasoning in large language models. D. Dynamic Tool Integration for Autonomous AI Agents

B. Why Do Multi-Agent LLM Systems Fail? Wu et al. [224] introduce Chain-of-Tools, a novel tool
Pan et al. [222] present a critical examination of why learning approach that leverages the robust semantic represen-
multi-agent LLM systems, despite the theoretical benefits of tation capabilities of frozen large language models (LLMs) to
collaboration, continue to underperform compared to their perform tool calling as part of a chain-of-thought reasoning
single-agent counterparts. Through a rigorous study of five process. By utilizing a vast and flexible tool pool that can
open-source frameworks across 150 tasks, the authors enlist include previously unseen tools, this method addresses the
expert human annotators to identify fourteen distinct failure inefficiencies and highlights key challenges, including man-
modes ranging from ignoring task or role specifications and aging vast prompt-based demonstrations. The authors validate
unnecessary repetition, to lapses in memory and flawed verifi- their approach on a range of datasets, including a newly
cation processes. These issues are systematically grouped into constructed dataset, SimpleToolQuestions, as well as GSM8K-
three categories: design and specification shortcomings, inter- XL, FuncQA, and KAMEL, demonstrating that Chain-of-
agent misalignment, and challenges in task verification and Tools outperforms conventional baselines. Additionally, the
termination. Moreover, the study explores interventions such method holds promise for enhancing autonomous AI agents by
as refining agent role definitions and orchestration strategies, enabling them to select and utilize external tools dynamically,
but finds that these measures alone are insufficient; thereby, thereby broadening their capability to solve complex, multi-
it outlines a clear roadmap for future research to address the step tasks independently. This work prompts several questions:
intricate challenges inherent in multi-agent coordination. How can the integration of unseen tools further enhance LLM
adaptability in diverse scenarios? What critical dimensions
of the model output influence effective tool selection, and
C. AI Agents in Automated Scientific Discovery how can they be optimized for greater interpretability? More-
Liu et al. [223] introduce a large-scale benchmark for over, how might this methodology be extended to enable
evaluating the capability of large language models (LLMs) in more robust autonomous decision-making in AI agents facing
generating high-quality scientific research hypotheses. It tack- increasingly complex reasoning challenges? Notably, these
les this gap by focusing on three pivotal sub-tasks: inspiration questions also underscore key challenges such as managing
retrieval, hypothesis composition, and hypothesis ranking. The a huge tool pool, ensuring efficient tool selection, enhancing
authors have developed an automated framework that extracts model interpretability, and integrating autonomous AI agents
key components from scientific papers, including research capable of dynamic, independent operation.
39

E. Empowering LLM Agents with Integrated Search via Rein- Researchers have developed various training and infer-
forcement Learning ence strategies to cultivate these reasoning abilities, including
ReSearch [225] represents a significant step toward endow- inference-time scaling, pure reinforcement learning (for ex-
ing LLM-based agents with the ability to decide autonomously ample, DeepSeek-R1-Zero), supervised fine-tuning combined
when and how to consult external knowledge sources, seam- with reinforcement learning, and distillation-based fine-tuning.
lessly weaving search operations into their reasoning chains Adaptations of Qwen-32B and Llama-based architectures
via reinforcement learning. By framing search as an action- show that a balanced combination of these methods yields
able tokenized operation rather than a separate retrieval step emergent reasoning behaviors while reducing overthinking and
ReSearch trains models like Qwen2.5 through a reward signal verbosity.
that emphasizes final-answer accuracy and adherence to a We also provided a unified comparison of state-of-the-art
structured think/search/result format. This paradigm eliminates benchmarks from 2019 to 2025, together with a taxonomy of
the need for painstakingly annotated reasoning traces and approximately 60 evaluation suites. Our analysis encompasses
yields strong multi-hop question–answering performance and training frameworks, including mixture-of-experts, retrieval-
cross-domain generalization. Yet, several challenges remain augmented generation, and reinforcement learning, as well as
for deploying such agents in the wild: how to scale the ap- architectural enhancements that drive performance improve-
proach to richer, real-time toolsets (e.g., calculators, databases, ments. In addition, we reviewed AI agent frameworks devel-
code execution environments) without blowing up action oped between 2023 and 2025 and illustrated their applications
spaces; how to design more nuanced reward functions that in domains including materials science, biomedical research,
capture partial credit for intermediate reasoning or mitigate synthetic data generation, and financial forecasting.
reward hacking; how to ensure robustness and interpretability Despite these successes, several challenges remain. Key
when agents autonomously interleave reasoning and tool use; open problems include automating multi-step reasoning with-
and how to balance exploration of novel tool sequences against out human oversight, balancing structured guidance with
exploitation of known effective patterns. Addressing these model flexibility, and integrating long-context retrieval at
questions will be crucial for realizing truly versatile, trust- scale. Future research must address these challenges to unlock
worthy LLM agents capable of complex, multi-step problem- the full potential of autonomous AI agents.
solving. Looking ahead, we anticipate an increasing focus on
domain- and application-specific optimization. Early exam-
ples, such as DeepSeek-R1-Distill, Sky-T1, and TinyZero,
F. Vulnerabilities of AI Agents Protocols demonstrate how specialized reasoning systems can achieve
MCP protocol standardizes how AI applications provide a favorable trade-off between performance and computational
context to LLMs. The MCP protocol faces critical vulnera- cost. Continued innovation in training methodologies, model
bilities in Agent AI communications due to its fundamentally architectures, and benchmarking will drive the next generation
decentralized design [216]. Without a central authority over- of high-efficiency, high-accuracy AI reasoning systems.
seeing security, disparate implementation practices can lead to
uneven defenses, making it easier for attackers to exploit weak
links. In particular, the absence of a standardized authentica- R EFERENCES
tion mechanism across different nodes hinders reliable identity
verification, thereby increasing the risk of unauthorized access [1] A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low,
A. Helyar, A. Madry, A. Beutel, A. Carney et al., “Openai o1 system
and potential data breaches. Moreover, deficiencies in robust card,” arXiv preprint arXiv:2412.16720, 2024.
logging and debugging tools further complicate the timely [2] J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen,
detection of anomalies and errors, which is vital for preventing J. Wang, Y. Fan, K. Dang, B. Zhang, X. Wang, Y. Chu, and
and mitigating attacks. Additionally, the complexity inherent in J. Lin, “Qwen2.5-omni technical report,” 2025. [Online]. Available:
https://arxiv.org/abs/2503.20215
managing multi-step, distributed workflows can lead to state [3] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma,
inconsistencies and operational glitches, amplifying the po- P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability
tential impact of a security compromise across interconnected in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948,
2025.
systems. [4] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle,
A. Letman, A. Mathur, A. Schelten, A. Vaughan et al., “The llama 3
herd of models,” arXiv preprint arXiv:2407.21783, 2024.
VI. C ONCLUSION [5] X. Huang, W. Liu, X. Chen, X. Wang, H. Wang, D. Lian, Y. Wang,
R. Tang, and E. Chen, “Understanding the planning of llm agents: A
In this paper, we have surveyed recent advances in the survey,” arXiv preprint arXiv:2402.02716, 2024.
reasoning capabilities of large language models (LLMs) and [6] J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen,
autonomous AI agents and highlighted the benefits of multi- S. Ma, H. Liu et al., “A survey on llm-as-a-judge,” arXiv preprint
arXiv:2411.15594, 2024.
step, intermediate processing for solving complex tasks in
[7] Q. Wang, R. Ding, Z. Chen, W. Wu, S. Wang, P. Xie, and F. Zhao,
advanced mathematics, code generation, and logical reasoning. “Vidorag: Visual document retrieval-augmented generation via dynamic
By exposing their internal reasoning through intermediate iterative reasoning agents,” arXiv preprint arXiv:2502.18017, 2025.
steps, models such as DeepSeek-R1, OpenAI o1 and o3, and [8] Y. Li, Y. Li, X. Wang, Y. Jiang, Z. Zhang, X. Zheng, H. Wang, H.-T.
Zheng, P. Xie, P. S. Yu et al., “Benchmarking multimodal retrieval
GPT-4o achieve greater accuracy and reliability compared to augmented generation with dynamic vqa dataset and self-adaptive
direct-response approaches. planning agent,” arXiv preprint arXiv:2411.02937, 2024.
40

[9] H. Q. Yu and F. McQuade, “Rag-kg-il: A multi-agent hybrid framework [32] A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y.-X. Wang,
for reducing hallucinations and enhancing llm reasoning through rag “Language agent tree search unifies reasoning acting and planning in
and incremental knowledge graph learning integration,” arXiv preprint language models,” arXiv preprint arXiv:2310.04406, 2023.
arXiv:2503.13514, 2025. [33] H. Su and Others, “Learn-by-interact: A data-centric framework
[10] S. Ateia and U. Kruschwitz, “Bioragent: A retrieval-augmented gener- for self-adaptive agents in realistic environments,” arXiv preprint
ation system for showcasing generative query expansion and domain- arXiv:2501.10893, 2025.
specific search for scientific q&a,” arXiv preprint arXiv:2412.12358, [34] M. Hu, P. Zhao, C. Xu, Q. Sun, J. Lou, Q. Lin, P. Luo, and S. Ra-
2024. jmohan, “Agentgen: Enhancing planning abilities for large language
[11] H. Shimadzu, T. Utsuro, and D. Kitayama, “Retrieval-augmented model based agent via environment and task generation,” arXiv preprint
simulacra: Generative agents for up-to-date and knowledge-adaptive arXiv:2408.00764, 2024.
simulations,” arXiv preprint arXiv:2503.14620, 2025. [35] A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y. Dong, and J. Tang,
[12] G. Xiong, Q. Jin, X. Wang, Y. Fang, H. Liu, Y. Yang, F. Chen, Z. Song, “Agenttuning: Enabling generalized agent abilities for llms,” arXiv
D. Wang, M. Zhang et al., “Rag-gym: Optimizing reasoning and search preprint arXiv:2310.12823, 2023.
agents with process supervision,” arXiv preprint arXiv:2502.13957, [36] C. Gulcehre, T. L. Paine, S. Srinivasan, K. Konyushkova, L. Weerts,
2025. A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Gu et al., “Re-
[13] M. A. Ferrag, N. Tihanyi, and M. Debbah, “Reasoning beyond limits: inforced self-training (rest) for language modeling,” arXiv preprint
Advances and open problems for llms,” 2025. [Online]. Available: arXiv:2308.08998, 2023.
https://arxiv.org/abs/2503.22732
[37] R. Aksitov, S. Miryoosefi, Z. Li, D. Li, S. Babayan, K. Kopparapu,
[14] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
Z. Fisher, R. Guo, S. Prakash, P. Srinivasan et al., “Rest meets react:
D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4
Self-improvement for multi-step reasoning llm agent,” arXiv preprint
technical report,” arXiv preprint arXiv:2303.08774, 2023.
arXiv:2312.10003, 2023.
[15] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut,
J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gem- [38] T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest,
ini: a family of highly capable multimodal models,” arXiv preprint and X. Zhang, “Large language model based multi-agents: A survey
arXiv:2312.11805, 2023. of progress and challenges,” arXiv preprint arXiv:2402.01680, 2024.
[16] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, [39] A. Goldie, A. Mirhoseini, H. Zhou, I. Cai, and C. D. Manning,
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama “Synthetic data generation & multi-step rl for reasoning & tool use,”
2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv preprint arXiv:2504.04736, 2025.
arXiv:2307.09288, 2023. [40] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang,
[17] S. Schmidgall, Y. Su, Z. Wang, X. Sun, J. Wu, X. Yu, J. Liu, Z. Liu, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou et al., “Metagpt: Meta
and E. Barsoum, “Agent laboratory: Using llm agents as research programming for multi-agent collaborative framework,” arXiv preprint
assistants,” arXiv preprint arXiv:2501.04227, 2025. arXiv:2308.00352, vol. 3, no. 4, p. 6, 2023.
[18] A. Ajith, M. Xia, A. Chevalier, T. Goyal, D. Chen, and T. Gao, [41] C. Qian, X. Cong, C. Yang, W. Chen, Y. Su, J. Xu, Z. Liu, and
“Litsearch: A retrieval benchmark for scientific literature search,” arXiv M. Sun, “Communicative agents for software development,” arXiv
preprint arXiv:2407.18940, 2024. preprint arXiv:2307.07924, vol. 6, no. 3, 2023.
[19] H. Kang and C. Xiong, “Researcharena: Benchmarking llms’ ability [42] Z. Mandi, S. Jain, and S. Song, “Roco: Dialectic multi-robot col-
to collect and organize information as research agents,” arXiv preprint laboration with large language models,” in 2024 IEEE International
arXiv:2406.10291, 2024. Conference on Robotics and Automation (ICRA). IEEE, 2024, pp.
[20] J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang, “Researchagent: 286–299.
Iterative research idea generation over scientific literature with large [43] H. Zhang, W. Du, J. Shan, Q. Zhou, Y. Du, J. B. Tenenbaum, T. Shu,
language models,” arXiv preprint arXiv:2404.07738, 2024. and C. Gan, “Building cooperative embodied agents modularly with
[21] M. Gridach, J. Nanavati, K. Z. E. Abidine, L. Mendes, and C. Mack, large language models,” arXiv preprint arXiv:2307.02485, 2023.
“Agentic ai for scientific discovery: A survey of progress, challenges, [44] J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and
and future directions,” arXiv preprint arXiv:2503.08979, 2025. M. S. Bernstein, “Generative agents: Interactive simulacra of human
[22] Y. Kim, C. Park, H. Jeong, Y. S. Chan, X. Xu, D. McDuff, H. Lee, behavior,” in Proceedings of the 36th annual acm symposium on user
M. Ghassemi, C. Breazeal, H. Park et al., “Mdagents: An adaptive interface software and technology, 2023, pp. 1–22.
collaboration of llms for medical decision-making,” Advances in Neural [45] B. Xiao, Z. Yin, and Z. Shan, “Simulating public administration crisis:
Information Processing Systems, vol. 37, pp. 79 410–79 452, 2024. A novel generative agent-based simulation system to lower technology
[23] S. Mukherjee, P. Gamble, M. S. Ausin, N. Kant, K. Aggarwal, barriers in social science research,” arXiv preprint arXiv:2311.06957,
N. Manjunath, D. Datta, Z. Liu, J. Ding, S. Busacca et al., “Polaris: 2023.
A safety-focused llm constellation architecture for healthcare,” arXiv [46] S. Wang, C. Liu, Z. Zheng, S. Qi, S. Chen, Q. Yang, A. Zhao,
preprint arXiv:2403.13313, 2024. C. Wang, S. Song, and G. Huang, “Avalon’s game of thoughts: Battle
[24] T. Yuan, Z. He, L. Dong, Y. Wang, R. Zhao, T. Xia, L. Xu, B. Zhou, against deception through recursive contemplation,” arXiv preprint
F. Li, Z. Zhang et al., “R-judge: Benchmarking safety risk awareness arXiv:2310.01320, 2023.
for llm agents,” arXiv preprint arXiv:2401.10019, 2024. [47] Y. Wang, W. Zhong, Y. Huang, E. Shi, M. Yang, J. Chen, H. Li, Y. Ma,
[25] W. YAN, J. HU, H. ZENG, M. LIU, and W. LIANG, “The application Q. Wang, and Z. Zheng, “Agents in software engineering: Survey,
of large language models in primary healthcare services and the landscape, and vision,” arXiv preprint arXiv:2409.09030, 2024.
challenges,” Chinese General Practice, vol. 28, no. 01, p. 1, 2025.
[48] H. Jin, L. Huang, H. Cai, J. Yan, B. Li, and H. Chen, “From llms to llm-
[26] H. Yu, J. Zhou, L. Li, S. Chen, J. Gallifant, A. Shi, X. Li, W. Hua,
based agents for software engineering: A survey of current, challenges
M. Jin, G. Chen et al., “Aipatient: Simulating patients with ehrs and llm
and future,” arXiv preprint arXiv:2408.02479, 2024.
powered agentic workflow,” arXiv preprint arXiv:2409.18924, 2024.
[27] S. Schmidgall, R. Ziaei, C. Harris, E. Reis, J. Jopling, and M. Moor, [49] A. Singh, A. Ehtesham, S. Kumar, and T. T. Khoei, “Agentic retrieval-
“Agentclinic: a multimodal agent benchmark to evaluate ai in simulated augmented generation: A survey on agentic rag,” arXiv preprint
clinical environments,” arXiv preprint arXiv:2405.07960, 2024. arXiv:2501.09136, 2025.
[28] W. Wang, Z. Ma, Z. Wang, C. Wu, W. Chen, X. Li, and Y. Yuan, “A [50] A. Yehudai, L. Eden, A. Li, G. Uziel, Y. Zhao, R. Bar-Haim, A. Cohan,
survey of llm-based agents in medicine: How far are we from baymax?” and M. Shmueli-Scheuer, “Survey on evaluation of llm-based agents,”
arXiv preprint arXiv:2502.11211, 2025. 2025. [Online]. Available: https://arxiv.org/abs/2503.16416
[29] X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji, “Exe- [51] Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y. Zhou,
cutable code actions elicit better llm agents,” in Forty-first International T. Gao, and W. Che, “Towards reasoning era: A survey of long
Conference on Machine Learning, 2024. chain-of-thought for reasoning large language models,” arXiv preprint
[30] N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, arXiv:2503.09567, 2025.
“Reflexion: Language agents with verbal reinforcement learning,” [52] B. Yan, X. Zhang, L. Zhang, L. Zhang, Z. Zhou, D. Miao, and C. Li,
Advances in Neural Information Processing Systems, vol. 36, pp. 8634– “Beyond self-talk: A communication-centric survey of llm-based multi-
8652, 2023. agent systems,” arXiv preprint arXiv:2502.14321, 2025.
[31] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, [53] X. Feng, L. Dou, E. Li, Q. Wang, H. Wang, Y. Guo, C. Ma, and
“React: Synergizing reasoning and acting in language models,” in L. Kong, “A survey on large language model-based social agents in
International Conference on Learning Representations (ICLR), 2023. game-theoretic scenarios,” arXiv preprint arXiv:2412.03920, 2024.
41

[54] C. Zhang, S. He, J. Qian, B. Li, L. Li, S. Qin, Y. Kang, M. Ma, G. Liu, [76] M. Kazemi, B. Fatemi, H. Bansal, J. Palowitch, C. Anastasiou, S. V.
Q. Lin et al., “Large language model-brained gui agents: A survey,” Mehta, L. K. Jain, V. Aglietti, D. Jindal, P. Chen et al., “Big-bench
arXiv preprint arXiv:2411.18279, 2024. extra hard,” arXiv preprint arXiv:2502.19187, 2025.
[55] Y. Li, H. Wen, W. Wang, X. Li, Y. Yuan, G. Liu, J. Liu, W. Xu, [77] K. Zhu, H. Du, Z. Hong, X. Yang, S. Guo, Z. Wang, Z. Wang, C. Qian,
X. Wang, Y. Sun et al., “Personal llm agents: Insights and sur- X. Tang, H. Ji et al., “Multiagentbench: Evaluating the collaboration
vey about the capability, efficiency and security,” arXiv preprint and competition of llm agents,” arXiv preprint arXiv:2503.01935, 2025.
arXiv:2401.05459, 2024. [78] G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom, “Gaia:
[56] M. C. Ramos, C. J. Collison, and A. D. White, “A review of large a benchmark for general ai assistants,” in The Twelfth International
language models and autonomous agents in chemistry,” Chemical Conference on Learning Representations, 2023.
Science, 2025. [79] R. A. Dubniczky, K. Z. Horvát, T. Bisztray, M. A. Ferrag, L. C.
[57] C. J. Wang, D. Lee, C. Menghini, J. Mols, J. Doughty, A. Khoja, Cordeiro, and N. Tihanyi, “Castle: Benchmarking dataset for static
J. Lynch, S. Hendryx, S. Yue, and D. Hendrycks, “Enigmaeval: A code analyzers and llms towards cwe detection,” arXiv preprint
benchmark of long multimodal reasoning challenges,” arXiv preprint arXiv:2503.09433, 2025.
arXiv:2502.08859, 2025. [80] J. Yao, K. Wang, R. Hsieh, H. Zhou, T. Zou, Z. Cheng, Z. Wang, and
[58] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and P. Viswanath, “Spin-bench: How well do llms plan strategically and
J. Steinhardt, “Measuring massive multitask language understanding,” reason socially?” arXiv preprint arXiv:2503.12349, 2025.
arXiv preprint arXiv:2009.03300, 2020. [81] S. Yao, N. Shinn, P. Razavi, and K. Narasimhan, “A benchmark
[59] L. Zhong, Z. Du, X. Zhang, H. Hu, and J. Tang, “Complexfuncbench: for tool-agent-user interaction in real-world domains,” arXiv preprint
Exploring multi-step and constrained function calling under long- arXiv:2406.12045, 2024.
context scenario,” arXiv preprint arXiv:2501.10132, 2025. [82] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner,
[60] L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, S. Shi, M. Choi, “Drop: A reading comprehension benchmark requiring discrete reason-
A. Agrawal, A. Chopra et al., “Humanity’s last exam,” arXiv preprint ing over paragraphs,” arXiv preprint arXiv:1903.00161, 2019.
arXiv:2501.14249, 2025. [83] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang,
[61] DeepMind, “Facts & grounding: A new benchmark for evaluating D. Song, and J. Steinhardt, “Measuring mathematical problem solving
the factuality of large language models,” 2023, accessed: 2025- with the math dataset,” arXiv preprint arXiv:2103.03874, 2021.
02-03. [Online]. Available: https://deepmind.google/discover/blog/ [84] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan,
facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-\ H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
large-language-models/ language models trained on code,” arXiv preprint arXiv:2107.03374,
[62] C. Zheng, Z. Zhang, B. Zhang, R. Lin, K. Lu, B. Yu, D. Liu, J. Zhou, 2021.
and J. Lin, “Processbench: Identifying process errors in mathematical
[85] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W.
reasoning,” arXiv preprint arXiv:2412.06559, 2024.
Chung, Y. Tay, S. Ruder, D. Zhou et al., “Language models are multi-
[63] L. Ouyang, Y. Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, lingual chain-of-thought reasoners,” arXiv preprint arXiv:2210.03057,
Z. Zhao, M. Jiang, X. Zhao et al., “Omnidocbench: Benchmarking 2022.
diverse pdf document parsing with comprehensive annotations,” arXiv
[86] V. Samuel, H. P. Zou, Y. Zhou, S. Chaudhari, A. Kalyan, T. Rajpurohit,
preprint arXiv:2412.07626, 2024.
A. Deshpande, K. Narasimhan, and V. Murahari, “Personagym: Evalu-
[64] M. Zhuge, C. Zhao, D. Ashley, W. Wang, D. Khizbullin, Y. Xiong,
ating persona agents and llms,” arXiv preprint arXiv:2407.18416, 2024.
Z. Liu, E. Chang, R. Krishnamoorthi, Y. Tian et al., “Agent-as-a-judge:
Evaluate agents with agents,” arXiv preprint arXiv:2410.10934, 2024. [87] C. Ye, Z. Hu, Y. Deng, Z. Huang, M. D. Ma, Y. Zhu, and W. Wang,
“Mirai: Evaluating llm agents for event forecasting,” arXiv preprint
[65] S. Tan, S. Zhuang, K. Montgomery, W. Y. Tang, A. Cuadron, C. Wang,
arXiv:2407.01231, 2024.
R. A. Popa, and I. Stoica, “Judgebench: A benchmark for evaluating
llm-based judges,” arXiv preprint arXiv:2410.12784, 2024. [88] H. Trivedi, T. Khot, M. Hartmann, R. Manku, V. Dong, E. Li, S. Gupta,
[66] OpenAI, “Introducing simpleqa,” 2024, accessed: 2025-02-03. A. Sabharwal, and N. Balasubramanian, “Appworld: A controllable
[Online]. Available: https://openai.com/index/introducing-simpleqa/ world of apps and people for benchmarking interactive coding agents,”
arXiv preprint arXiv:2407.18901, 2024.
[67] HuggingFaceFW, “Fine tasks,” 2024, accessed: 2025-02-03.
[Online]. Available: https://huggingface.co/spaces/HuggingFaceFW/ [89] X. Liu, T. Zhang, Y. Gu, I. L. Iong, Y. Xu, X. Song, S. Zhang, H. Lai,
blogpost-fine-tasks X. Liu, H. Zhao et al., “Visualagentbench: Towards large multimodal
[68] S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stam- models as visual foundation agents,” arXiv preprint arXiv:2408.06327,
bler, S. Upadhyay, and M. Faruqui, “Fact, fetch, and reason: A 2024.
unified evaluation of retrieval-augmented generation,” arXiv preprint [90] Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao,
arXiv:2409.12941, 2024. C. Wei, Z. Lu et al., “Scienceagentbench: Toward rigorous assessment
[69] Hugging Face, “Dabstep,” 2025, accessed: 2025-02-03. [Online]. of language agents for data-driven scientific discovery,” arXiv preprint
Available: https://huggingface.co/blog/dabstep arXiv:2410.05080, 2024.
[70] H. Mao, C. C.-J. Ji, F. Yan, T. Zhang, and S. G. Patil, “Bfcl v2 [91] Z. Zhang, S. Cui, Y. Lu, J. Zhou, J. Yang, H. Wang, and M. Huang,
live,” https://gorilla.cs.berkeley.edu/blogs/12 bfcl v2 live.html, 2024, “Agent-safetybench: Evaluating the safety of llm agents,” arXiv
accessed: February 16, 2025. preprint arXiv:2412.14470, 2024.
[71] S. Miserendino, M. Wang, T. Patwardhan, and J. Heidecke, [92] B. P. Majumder, H. Surana, D. Agarwal, B. D. Mishra, A. Meena,
“Swe-lancer: Can frontier llms earn $1 million from real world A. Prakhar, T. Vora, T. Khot, A. Sabharwal, and P. Clark, “Discov-
freelance software engineering?” 2025. [Online]. Available: https: erybench: Towards data-driven discovery with large language models,”
//arxiv.org/abs/2502.12115 arXiv preprint arXiv:2407.01725, 2024.
[72] X. Yang, K. Sun, H. Xin, Y. Sun, N. Bhalla, X. Chen, S. Choudhary, [93] K. Gu, R. Shang, R. Jiang, K. Kuang, R.-J. Lin, D. Lyu, Y. Mao, Y. Pan,
R. D. Gui, Z. W. Jiang, Z. Jiang et al., “Crag–comprehensive rag T. Wu, J. Yu et al., “Blade: Benchmarking language model agents for
benchmark,” arXiv preprint arXiv:2406.04744, 2024. data-driven science,” arXiv preprint arXiv:2408.09667, 2024.
[73] M. Kouremetis, M. Dotter, A. Byrne, D. Martin, E. Michalak, [94] J. Liu, W. Wang, Z. Ma, G. Huang, Y. SU, K.-J. Chang, W. Chen, H. Li,
G. Russo, M. Threet, and G. Zarrella, “Occult: Evaluating large L. Shen, and M. Lyu, “Medchain: Bridging the gap between llm agents
language models for offensive cyber operation capabilities,” 2025. and clinical practice through interactive sequential benchmarking,”
[Online]. Available: https://arxiv.org/abs/2502.15797 arXiv preprint arXiv:2412.01605, 2024.
[74] N. Tihanyi, T. Bisztray, R. A. Dubniczky, R. Toth, B. Borsos, B. Cherif, [95] Q. Long, Z. Li, R. Gong, Y. N. Wu, D. Terzopoulos, and X. Gao,
R. Jain, L. Muzsai, M. A. Ferrag, R. Marinelli et al., “Dynamic “Teamcraft: A benchmark for multi-modal multi-agent systems in
intelligence assessment: Benchmarking llms on the road to agi with minecraft,” arXiv preprint arXiv:2412.05255, 2024.
a focus on model confidence,” in 2024 IEEE International Conference [96] M. Andriushchenko, A. Souly, M. Dziemian, D. Duenas, M. Lin,
on Big Data (BigData). IEEE, 2024, pp. 3313–3321. J. Wang, D. Hendrycks, A. Zou, Z. Kolter, M. Fredrikson et al.,
[75] N. Tihanyi, M. A. Ferrag, R. Jain, T. Bisztray, and M. Debbah, “Agentharm: A benchmark for measuring harmfulness of llm agents,”
“Cybermetric: a benchmark dataset based on retrieval-augmented gen- arXiv preprint arXiv:2410.09024, 2024.
eration for evaluating llms in cybersecurity knowledge,” in 2024 IEEE [97] H. Li, J. Chen, J. Yang, Q. Ai, W. Jia, Y. Liu, K. Lin, Y. Wu, G. Yuan,
International Conference on Cyber Security and Resilience (CSR). Y. Hu et al., “Legalagentbench: Evaluating llm agents in legal domain,”
IEEE, 2024, pp. 296–302. arXiv preprint arXiv:2412.17259, 2024.
42

[98] D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, [120] M. S. Rashid, C. Bock, Y. Zhuang, A. Buccholz, T. Esler, S. Valentin,
J. Michael, and S. R. Bowman, “Gpqa: A graduate-level google-proof L. Franceschi, M. Wistuba, P. T. Sivaprasad, W. J. Kim, A. Deoras,
q&a benchmark,” in First Conference on Language Modeling, 2024. G. Zappella, and L. Callot, “Swe-polybench: A multi-language bench-
[99] X. Tang, D. Shao, J. Sohn, J. Chen, J. Zhang, J. Xiang, F. Wu, Y. Zhao, mark for repository level evaluation of coding agents,” 2025.
C. Wu, W. Shi et al., “Medagentsbench: Benchmarking thinking models [121] D. Zan, Z. Huang, W. Liu, H. Chen, L. Zhang, S. Xin, L. Chen, Q. Liu,
and agent frameworks for complex medical reasoning,” arXiv preprint X. Zhong, A. Li et al., “Multi-swe-bench: A multilingual benchmark
arXiv:2503.07459, 2025. for issue resolving,” arXiv preprint arXiv:2504.02605, 2025.
[100] Z. Cheng, Y. Tu, R. Li, S. Dai, J. Hu, S. Hu, J. Li, Y. Shi, T. Yu, [122] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch,
W. Chen et al., “Embodiedeval: Evaluate multimodal llms as embodied A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso et al., “Beyond
agents,” arXiv preprint arXiv:2501.11858, 2025. the imitation game: Quantifying and extrapolating the capabilities of
[101] Z. Huang, Z. Wang, S. Xia, X. Li, H. Zou, R. Xu, R.-Z. Fan, language models,” arXiv preprint arXiv:2206.04615, 2022.
L. Ye, E. Chern, Y. Ye et al., “Olympicarena: Benchmarking multi- [123] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung,
discipline cognitive reasoning for superintelligent ai,” Advances in A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou et al., “Challenging
Neural Information Processing Systems, vol. 37, pp. 19 209–19 253, big-bench tasks and whether chain-of-thought can solve them,” arXiv
2024. preprint arXiv:2210.09261, 2022.
[102] Y. Xiang, H. Yan, S. Ouyang, L. Gui, and Y. He, “Scireplicate- [124] Langchain agents tutorial. Accessed: February 23, 2025. [Online].
bench: Benchmarking llms in agent-driven algorithmic reproduction Available: https://python.langchain.com/docs/tutorials/agents/
from research papers,” arXiv preprint arXiv:2504.00255, 2025. [125] Building a basic agent. Accessed: February 23, 2025. [Online].
[103] S. Fish, J. Shephard, M. Li, R. I. Shorrer, and Y. A. Gonczarowski, Available: https://docs.llamaindex.ai/en/stable/understanding/agent/
“Econevals: Benchmarks and litmus tests for llm agents in unknown [126] Crewai. Accessed: February 23, 2025. [Online]. Available: https:
environments,” arXiv preprint arXiv:2503.18825, 2025. //www.crewai.com/
[104] Y. Y. Sung, H. Kim, and D. Zhang, “Verila: A human-centered eval- [127] OpenAI, “swarm,” 2024, accessed: 2025-02-03. [Online]. Available:
uation framework for interpretable verification of llm agent failures,” https://github.com/openai/swarm/tree/main
arXiv preprint arXiv:2503.12651, 2025. [128] S. Hu, M. Ouyang, D. Gao, and M. Z. Shou, “The dawn of gui agent:
[105] Y. Yang, B. Huang, S. Qi, C. Feng, H. Hu, Y. Zhu, J. Hu, H. Zhao, A preliminary case study with claude 3.5 computer use,” arXiv preprint
Z. He, X. Liu et al., “Who’s the mvp? a game-theoretic evaluation arXiv:2411.10323, 2024.
benchmark for modular attribution in llm agents,” arXiv preprint [129] J. Wu, J. Zhu, and Y. Liu, “Agentic reasoning: Reasoning llms with
arXiv:2502.00510, 2025. tools for the deep research,” arXiv preprint arXiv:2502.04644, 2025.
[106] Z. Li, S. Huang, J. Wang, N. Zhang, A. Antoniades, W. Hua, K. Zhu, [130] P. Lu, B. Chen, S. Liu, R. Thapa, J. Boen, and J. Zou, “Octotools: An
S. Zeng, W. Y. Wang, and X. Yan, “Agentorca: A dual-system frame- agentic framework with extensible tools for complex reasoning,” arXiv
work to evaluate language agents on operational routine and constraint preprint arXiv:2502.11271, 2025.
adherence,” arXiv preprint arXiv:2503.08669, 2025.
[131] OpenAI, “Agents sdk,” https://platform.openai.com/docs/guides/
[107] K. Liu, Y. Pan, J. Li, D. He, Y. Xiang, Y. Du, and T. Gao, “Projecteval:
agents-sdk, accessed: March 18, 2025.
A benchmark for programming agents automated evaluation on project-
[132] C. Wang, X. Hu, Y. Zhang, X. Chen, P. Du, Y. Mao, R. Wang,
level code generation,” arXiv preprint arXiv:2503.07010, 2025.
Y. Li, Y. Wu, H. Yang et al., “Starwhisper telescope: Agent-based
[108] D. Gautam, S. Garg, J. Jang, N. Sundaresan, and R. Z. Moghad-
observation assistant system to approach ai astrophysicist,” arXiv
dam, “Refactorbench: Evaluating stateful reasoning in language agents
preprint arXiv:2412.06412, 2024.
through code,” arXiv preprint arXiv:2503.07832, 2025.
[109] Y. Song, K. Thai, C. M. Pham, Y. Chang, M. Nadaf, and M. Iyyer, [133] H. Zhang, Y. Song, Z. Hou, S. Miret, and B. Liu, “Honeycomb: A
“Bearcubs: A benchmark for computer-using web agents,” arXiv flexible llm-based agent system for materials science,” arXiv preprint
preprint arXiv:2503.07919, 2025. arXiv:2409.00135, 2024.
[110] G. Gonzalez-Pumariega, L. S. Yean, N. Sunkara, and S. Choudhury, [134] Z. Wang, Q. Jin, C.-H. Wei, S. Tian, P.-T. Lai, Q. Zhu, C.-P. Day,
“Robotouille: An asynchronous planning benchmark for llm agents,” C. Ross, and Z. Lu, “Geneagent: self-verification language agent for
arXiv preprint arXiv:2502.05227, 2025. gene set knowledge discovery using domain databases,” arXiv preprint
[111] W. Tang, Y. Zhou, E. Xu, K. Cheng, M. Li, and L. Xiao, “Dsg- arXiv:2405.16205, 2024.
bench: A diverse strategic game benchmark for evaluating llm-based [135] M. J. Buehler, “Preflexor: Preference-based recursive language mod-
agents in complex decision-making environments,” arXiv preprint eling for exploratory optimization of reasoning and agentic thinking,”
arXiv:2503.06047, 2025. arXiv preprint arXiv:2410.12375, 2024.
[112] M. Ku, T. Chong, J. Leung, K. Shah, A. Yu, and W. Chen, “The- [136] X. Liang, J. Yang, Y. Wang, C. Tang, Z. Zheng, S. Niu, S. Song,
oremexplainagent: Towards multimodal explanations for llm theorem H. Wang, B. Tang, F. Xiong et al., “Surveyx: Academic survey au-
understanding,” arXiv preprint arXiv:2502.19400, 2025. tomation via large language models,” arXiv preprint arXiv:2502.14776,
[113] J. Yan, Y. Luo, and Y. Zhang, “Refutebench 2.0–agentic benchmark for 2025.
dynamic evaluation of llm responses to refutation instruction,” arXiv [137] L. Li, W. Xu, J. Guo, R. Zhao, X. Li, Y. Yuan, B. Zhang,
preprint arXiv:2502.18308, 2025. Y. Jiang, Y. Xin, R. Dang et al., “Chain of ideas: Revolutionizing
[114] D. Nathani, L. Madaan, N. Roberts, N. Bashlykov, A. Menon, research via novel idea development with llm agents,” arXiv preprint
V. Moens, A. Budhiraja, D. Magka, V. Vorotilov, G. Chaurasia et al., arXiv:2410.13185, 2024.
“Mlgym: A new framework and benchmark for advancing ai research [138] A. Mitra, L. Del Corro, G. Zheng, S. Mahajan, D. Rouhana, A. Co-
agents,” arXiv preprint arXiv:2502.14499, 2025. das, Y. Lu, W.-g. Chen, O. Vrousgos, C. Rosset et al., “Agentin-
[115] D. Zhang, S. Zhoubian, M. Cai, F. Li, L. Yang, W. Wang, T. Dong, struct: Toward generative teaching with agentic flows,” arXiv preprint
Z. Hu, J. Tang, and Y. Yue, “Datascibench: An llm agent benchmark arXiv:2407.03502, 2024.
for data science,” arXiv preprint arXiv:2502.13897, 2025. [139] X. Tang, T. Hu, M. Ye, Y. Shao, X. Yin, S. Ouyang, W. Zhou,
[116] R. Yang, H. Chen, J. Zhang, M. Zhao, C. Qian, K. Wang, Q. Wang, P. Lu, Z. Zhang, Y. Zhao et al., “Chemagent: Self-updating library in
T. V. Koripella, M. Movahedi, M. Li et al., “Embodiedbench: Compre- large language models improves chemical reasoning,” arXiv preprint
hensive benchmarking multi-modal large language models for vision- arXiv:2501.06590, 2025.
driven embodied agents,” arXiv preprint arXiv:2502.09560, 2025. [140] B. Liu, J. Zhang, F. Lin, X. Jia, and M. Peng, “\textit {One Size doesn’t
[117] J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, Fit All}: A personalized conversational tutoring agent for mathematics
A. Tachard, Passos, W. Fedus, and A. Glaese, “Browsecomp: A simple instruction,” arXiv preprint arXiv:2502.12633, 2025.
yet challenging benchmark for browsing agents,” https://cdn.openai. [141] Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong,
com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf, and J.-R. Wen, “A survey on the memory mechanism of large language
2025, accessed: 2025-04-13. model based agents,” arXiv preprint arXiv:2404.13501, 2024.
[118] A. Backlund and L. Petersson, “Vending-bench: A benchmark [142] P. Zhao, H. Zhang, Q. Yu, Z. Wang, Y. Geng, F. Fu, L. Yang, W. Zhang,
for long-term coherence of autonomous agents,” arXiv preprint J. Jiang, and B. Cui, “Retrieval-augmented generation for ai-generated
arXiv:2502.15840, 2025. content: A survey,” arXiv preprint arXiv:2402.19473, 2024.
[119] J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, [143] Y. Liu, S. K. Lo, Q. Lu, L. Zhu, D. Zhao, X. Xu, S. Harrer, and
G. Starace, K. Liu, L. Maksin, T. Patwardhan et al., “Mle-bench: J. Whittle, “Agent design pattern catalogue: A collection of architec-
Evaluating machine learning agents on machine learning engineering,” tural patterns for foundation model based agents,” Journal of Systems
arXiv preprint arXiv:2410.07095, 2025. and Software, vol. 220, p. 112278, 2025.
43

[144] “How to design an agent for production,” ac- [167] S. Hong, Y. Lin, B. Liu, B. Liu, B. Wu, C. Zhang, C. Wei, D. Li,
cessed: 2025-04-14. [Online]. Available: https://blog.langchain.dev/ J. Chen, J. Zhang et al., “Data interpreter: An llm agent for data
how-to-design-an-agent-for-production/ science,” arXiv preprint arXiv:2402.18679, 2024.
[145] Q. Sun, K. Cheng, Z. Ding, C. Jin, Y. Wang, F. Xu, Z. Wu, [168] J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic,
C. Jia, L. Chen, Z. Liu et al., “Os-genesis: Automating gui agent A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno et al., “Towards
trajectory construction via reverse task synthesis,” arXiv preprint an ai co-scientist,” arXiv preprint arXiv:2502.18864, 2025.
arXiv:2412.19723, 2024. [169] W. Dong, “The ann arbor architecture for agent-oriented programming,”
[146] J. Chen, C. Gui, A. Gao, K. Ji, X. Wang, X. Wan, and B. Wang, “Cod, arXiv preprint arXiv:2502.09903, 2025.
towards an interpretable medical agent using chain of diagnosis,” arXiv [170] N. Jain, J. Singh, M. Shetty, L. Zheng, K. Sen, and I. Stoica, “R2e-gym:
preprint arXiv:2407.13301, 2024. Procedural environments and hybrid verifiers for scaling open-weights
[147] Y. Zhou, P. Zhang, M. Song, A. Zheng, Y. Lu, Z. Liu, Y. Chen, and swe agents,” arXiv preprint arXiv:2504.07164, 2025.
Z. Xi, “Zodiac: A cardiologist-level llm framework for multi-agent [171] J. Wang, Y. Dai, Y. Zhang, Z. Ma, W. Li, and J. Chai, “Training turn-
diagnostics,” arXiv preprint arXiv:2410.02026, 2024. by-turn verifiers for dialogue tutoring agents: The curious case of llms
[148] Z. Wang, J. Wu, C. H. Low, and Y. Jin, “Medagent-pro: Towards as your coding tutors,” arXiv preprint arXiv:2502.13311, 2025.
multi-modal evidence-based medical diagnosis via reasoning agentic [172] H.-Y. Chen, C.-P. Huang, and J.-M. Yao, “Verbal process supervision
workflow,” arXiv preprint arXiv:2503.18968, 2025. elicits better coding agents,” arXiv preprint arXiv:2503.18494, 2025.
[149] I. Steenstra, F. Nouraei, and T. W. Bickmore, “Scaffolding empathy: [173] V. Aggarwal, O. Kamal, A. Japesh, Z. Jin, and B. Schölkopf, “Dars:
Training counselors with simulated patients and utterance-level perfor- Dynamic action re-sampling to enhance coding agent performance by
mance visualizations,” arXiv preprint arXiv:2502.18673, 2025. adaptive tree traversal,” arXiv preprint arXiv:2503.14269, 2025.
[150] J. Feng, Q. Zheng, C. Wu, Z. Zhao, Y. Zhang, Y. Wang, and W. Xie, [174] Z. Chen, X. Tang, G. Deng, F. Wu, J. Wu, Z. Jiang, V. Prasanna,
“Mˆ 3builder: A multi-agent system for automated machine learning A. Cohan, and X. Wang, “Locagent: Graph-guided llm agents for code
in medical imaging,” arXiv preprint arXiv:2502.20301, 2025. localization,” arXiv preprint arXiv:2503.09089, 2025.
[151] D. Rose, C.-C. Hung, M. Lepri, I. Alqassem, K. Gashteovski, [175] A. Gholamzadeh Khoee, S. Wang, Y. Yu, R. Feldt, and
and C. Lawrence, “Meddxagent: A unified modular agent frame- D. Parthasarathy, “Gatelens: A reasoning-enhanced llm agent for
work for explainable automatic differential diagnosis,” arXiv preprint automotive software release analytics,” arXiv e-prints, pp. arXiv–
arXiv:2502.19175, 2025. 2503, 2025.
[152] F. Ghezloo, M. S. Seyfioglu, R. Soraki, W. O. Ikezogwo, B. Li, [176] R. Hu, C. Peng, X. Wang, and C. Gao, “An llm-based agent for reliable
T. Vivekanandan, J. G. Elmore, R. Krishna, and L. Shapiro, “Pathfinder: docker environment configuration,” arXiv preprint arXiv:2502.13681,
A multi-modal multi-agent system for medical diagnostic decision- 2025.
making applied to histopathology,” arXiv preprint arXiv:2502.08916, [177] Y. Lu, B. Yao, H. Gu, J. Huang, J. Wang, L. Li, J. Gesi, Q. He, T. J.-
2025. J. Li, and D. Wang, “Uxagent: An llm agent-based usability testing
[153] M. A. Abbasi, F. S. Mirnezami, and H. Naderi, “Hamraz: A culture- framework for web design,” arXiv preprint arXiv:2502.12561, 2025.
based persian conversation dataset for person-centered therapy using
[178] J. Pan, X. Wang, G. Neubig, N. Jaitly, H. Ji, A. Suhr, and Y. Zhang,
llm agents,” arXiv preprint arXiv:2502.05982, 2025.
“Training software engineering agents and verifiers with swe-gym,”
[154] Y. Yang, P. Achananuparp, H. Huang, J. Jiang, K. P. Leng, N. G. Lim,
arXiv preprint arXiv:2412.21139, 2024.
C. T. S. Ern, and E.-p. Lim, “Cami: A counselor agent supporting mo-
[179] J. Yang, W. Zhang, J. Yang, Y. Miao, S. Quan, Z. Wu, Q. Peng, L. Yang,
tivational interviewing through state inference and topic exploration,”
T. Liu, Z. Cui et al., “Multi-agent collaboration for multilingual code
arXiv preprint arXiv:2502.02807, 2025.
instruction tuning,” arXiv preprint arXiv:2502.07487, 2025.
[155] A. Xu, D. Yang, R. Li, J. Zhu, M. Tan, M. Yang, W. Qiu, M. Ma,
H. Wu, B. Li et al., “Autocbt: An autonomous multi-agent framework [180] X. Guo, X. Wang, Y. Chen, S. Li, C. Han, M. Li, and H. Ji,
for cognitive behavioral therapy in psychological counseling,” arXiv “Syncmind: Measuring agent out-of-sync recovery in collaborative
preprint arXiv:2501.09426, 2025. software engineering,” arXiv preprint arXiv:2502.06994, 2025.
[156] J. Lee, K. Lim, Y.-C. Jung, and B.-H. Kim, “Psyche: A multi-faceted [181] M. A. Islam, M. E. Ali, and M. R. Parvez, “Codesim: Multi-agent code
patient simulation framework for evaluation of psychiatric assessment generation and problem solving through simulation-driven planning and
conversational agents,” arXiv preprint arXiv:2501.01594, 2025. debugging,” arXiv preprint arXiv:2502.05664, 2025.
[157] Y. Zhang, X. Yang, X. Li, S. Yu, Y. Luan, S. Feng, D. Wang, [182] X. Wan, H. Deng, K. Zou, and S. Xu, “Enhancing the efficiency and
and Y. Zhang, “Psydraw: A multi-agent multimodal system for accuracy of underlying asset reviews in structured finance: The appli-
mental health screening in left-behind children,” arXiv preprint cation of multi-agent framework,” arXiv preprint arXiv:2405.04294,
arXiv:2412.14769, 2024. 2024.
[158] Z. Du, L. Zheng, R. Hu, Y. Xu, X. Li, Y. Sun, W. Chen, J. Wu, [183] Y. Yang, Y. Zhang, M. Wu, K. Zhang, Y. Zhang, H. Yu, Y. Hu, and
H. Cai, and H. Ying, “Llms can simulate standardized patients via B. Wang, “Twinmarket: A scalable behavioral and socialsimulation for
agent coevolution,” arXiv preprint arXiv:2412.11716, 2024. financial markets,” arXiv preprint arXiv:2502.01506, 2025.
[159] R. Wasenmüller, K. Hilbert, and C. Benzmüller, “Script-based dialog [184] Y. Yu, Z. Yao, H. Li, Z. Deng, Y. Jiang, Y. Cao, Z. Chen, J. Suchow,
policy planning for llm-powered conversational agents: A basic archi- Z. Cui, R. Liu et al., “Fincon: A synthesized llm multi-agent system
tecture for an” ai therapist”,” arXiv preprint arXiv:2412.15242, 2024. with conceptual verbal reinforcement for enhanced financial decision
[160] R. Averly, F. N. Baker, and X. Ning, “Liddia: Language-based intelli- making,” Advances in Neural Information Processing Systems, vol. 37,
gent drug discovery agent,” arXiv preprint arXiv:2502.13959, 2025. pp. 137 010–137 045, 2024.
[161] X. Wang, Y. Zhang, X. Zhang, L. Yu, X. Lin, J. Jiang, B. Ma, and [185] R. Y. Lin, S. Ojha, K. Cai, and M. F. Chen, “Strategic collusion of
K. Yu, “Patentagent: Intelligent agent for automated pharmaceutical llm agents: Market division in multi-commodity competitions,” arXiv
patent analysis,” arXiv preprint arXiv:2410.21312, 2024. preprint arXiv:2410.00031, 2024.
[162] Y. Inoue, T. Song, and T. Fu, “Drugagent: Explainable drug repurposing [186] S. Fatemi and Y. Hu, “Enhancing financial question answering with
agent with large language model-based reasoning,” arXiv preprint a multi-agent reflection framework,” in Proceedings of the 5th ACM
arXiv:2408.13378, 2024. International Conference on AI in Finance, 2024, pp. 530–537.
[163] Z. Chen, Z. Peng, X. Liang, C. Wang, P. Liang, L. Zeng, [187] X. Han, N. Wang, S. Che, H. Yang, K. Zhang, and S. X. Xu, “Enhanc-
M. Ju, and Y. Yuan, “Map: Evaluation and multi-agent enhancement ing investment analysis: Optimizing ai-agent collaboration in financial
of large language models for inpatient pathways,” arXiv preprint research,” in Proceedings of the 5th ACM International Conference on
arXiv:2503.13205, 2025. AI in Finance, 2024, pp. 538–546.
[164] T. Yun, E. Yang, M. Safdari, J. H. Lee, V. V. Kumar, S. S. Mahdavi, [188] S. Han, C. Zhou, Y. Shen, T. Sun, Y. Zhou, X. Wang, Z. Yang, J. Zhang,
J. Amar, D. Peyton, R. Aharony, A. Michaelides et al., “Sleepless and H. Li, “Finsphere: A conversational stock analysis agent equipped
nights, sugary days: Creating synthetic users with health conditions for with quantitative tools based on real-time database,” arXiv preprint
realistic coaching agent interactions,” arXiv preprint arXiv:2502.13135, arXiv:2501.12399, 2025.
2025. [189] G. Fatouros, K. Metaxas, J. Soldatos, and M. Karathanassis, “Mar-
[165] X. Lin, S. Ma, J. Shan, X. Zhang, S. X. Hu, T. Guo, S. Z. Li, and ketsenseai 2.0: Enhancing stock analysis through llm agents,” arXiv
K. Yu, “Biokgbench: A knowledge graph checking benchmark of ai preprint arXiv:2502.00415, 2025.
agent for biomedical science,” arXiv preprint arXiv:2407.00466, 2024. [190] I. Okpala, A. Golgoon, and A. R. Kannan, “Agentic ai systems applied
[166] S. Schmidgall and M. Moor, “Agentrxiv: Towards collaborative au- to tasks in financial services: Modeling and model risk management
tonomous research,” arXiv preprint arXiv:2503.18102, 2025. crews,” arXiv preprint arXiv:2502.05439, 2025.
44

[191] J. Zeng, H. Liu, Z. Dai, X. Tang, C. Luo, S. Varshney, Z. Li, [214] “Beeai now has multiple agents, and a standardized way for
and Q. He, “Cite before you speak: Enhancing context-response them to talk,” accessed: 2025-04-14. [Online]. Available: https:
grounding in e-commerce conversational llm-agents,” arXiv preprint //research.ibm.com/blog/multiagent-bee-ai
arXiv:2503.04830, 2025. [215] “A2A: A New Era of Agent Interoperability,” accessed: 2025-
[192] H. Cho, D. Kim, S. Yang, C. Lee, H. Lee, and J. Choo, “Building 04-14. [Online]. Available: https://developers.googleblog.com/en/
resource-constrained language agents: A korean case study on chemical a2a-a-new-era-of-agent-interoperability/
toxicity information,” arXiv preprint arXiv:2503.17753, 2025. [216] X. Hou, Y. Zhao, S. Wang, and H. Wang, “Model context protocol
[193] S. Kumbhar, V. Mishra, K. Coutinho, D. Handa, A. Iquebal, and (mcp): Landscape, security threats, and future research directions,”
C. Baral, “Hypothesis generation for materials discovery and design arXiv preprint arXiv:2503.23278, 2025.
using goal-driven and constraint-guided llm agents,” arXiv preprint [217] W. Yuan, J. Yu, S. Jiang, K. Padthe, Y. Li, D. Wang, I. Kulikov, K. Cho,
arXiv:2501.13299, 2025. Y. Tian, J. E. Weston et al., “Naturalreasoning: Reasoning in the wild
[194] B. Lei, Y. Zhang, S. Zuo, A. Payani, and C. Ding, “Macm: Utilizing with 2.8 m challenging questions,” arXiv preprint arXiv:2502.13124,
a multi-agent system for condition mining in solving complex mathe- 2025.
matical problems,” arXiv preprint arXiv:2404.04735, 2024. [218] G. Penedo, H. Kydlı́ček, V. Sabolčec, B. Messmer, N. Foroutan,
[195] W. Xie, D. Liu, H. Yan, W. Wu, and Z. Liu, “Mathlearner: A large M. Jaggi, L. von Werra, and T. Wolf, “Fineweb2: A sparkling
language model agent framework for learning to solve mathematical update with 1000s of languages,” Dec. 2024. [Online]. Available:
problems,” arXiv preprint arXiv:2408.01779, 2024. https://huggingface.co/datasets/HuggingFaceFW/fineweb-2
[196] G. Lee, S. Park, J. Park, A. Chung, S. Park, Y. Park, B. Kim, and [219] Argilla, “Magpie ultra v0.1 [dataset],” https://huggingface.co/datasets/
M.-g. Cho, “Expanding search space with diverse prompting agents: argilla/magpie-ultra-v0.1, 2024, accessed: February 16, 2025.
An efficient sampling approach for llm mathematical reasoning,” arXiv [220] C. Costello, S. Guo, A. Goldie, and A. Mirhoseini, “Think, prune,
preprint arXiv:2410.09780, 2024. train, improve: Scaling reasoning without scaling models,” 2025.
[197] Y. Deng and P. Mineiro, “Flow-dpo: Improving llm mathemati- [Online]. Available: https://arxiv.org/abs/2504.18116
cal reasoning through online multi-agent learning,” arXiv preprint [221] V. Xiang, C. Snell, K. Gandhi, A. Albalak, A. Singh, C. Blagden,
arXiv:2410.22304, 2024. D. Phung, R. Rafailov, N. Lile, D. Mahan et al., “Towards system 2
[198] V. Li, Y. Fu, T. Knappe, K. Han, and K. Zhu, “Automating mathemati- reasoning in llms: Learning how to think with meta chain-of-though,”
cal proof generation using large language model agents and knowledge arXiv preprint arXiv:2501.04682, 2025.
graphs,” arXiv preprint arXiv:2503.11657, 2025. [222] M. Z. Pan, M. Cemri, L. A. Agrawal, S. Yang, B. Chopra, R. Tiwari,
[199] R. Wang, R. Pan, Y. Li, J. Zhang, Y. Jia, S. Diao, R. Pi, K. Keutzer, A. Parameswaran, K. Ramchandran, D. Klein et al., “Why
J. Hu, and T. Zhang, “Ma-lot: Multi-agent lean-based long chain-of- do multiagent systems fail?” in ICLR 2025 Workshop on Building Trust
thought reasoning enhances formal theorem proving,” arXiv preprint in Language Models and Applications.
arXiv:2503.03205, 2025. [223] Y. Liu, Z. Yang, T. Xie, J. Ni, B. Gao, Y. Li, S. Tang, W. Ouyang,
[200] M. Yue, W. Lyu, W. Mifdal, J. Suh, Y. Zhang, and Z. Yao, “Mathvc: E. Cambria, and D. Zhou, “Researchbench: Benchmarking llms in
An llm-simulated multi-character virtual classroom for mathematics scientific discovery via inspiration-based task decomposition,” 2025.
education,” arXiv preprint arXiv:2404.06711, 2024. [Online]. Available: https://arxiv.org/abs/2503.21248
[201] B. Liu, J. Zhang, F. Lin, X. Jia, and M. Peng, “One size doesn’t fit all: A [224] M. Wu, T. Zhu, H. Han, X. Zhang, W. Shao, and W. Chen, “Chain-
personalized conversational tutoring agent for mathematics instruction,” of-tools: Utilizing massive unseen tools in the cot reasoning of frozen
2025. [Online]. Available: https://arxiv.org/abs/2502.12633 language models,” arXiv preprint arXiv:2503.16779, 2025.
[202] T. Ma, J. Du, W. Huang, W. Wang, L. Xie, X. Zhong, and J. T. Zhou, [225] M. Chen, T. Li, H. Sun, Y. Zhou, C. Zhu, F. Yang, Z. Zhou, W. Chen,
“Llm knows geometry better than algebra: Numerical understanding of H. Wang, J. Z. Pan et al., “Learning to reason with search for llms via
llm-based agents in a trading arena,” arXiv preprint arXiv:2502.17967, reinforcement learning,” arXiv preprint arXiv:2503.19470, 2025.
2025.
[203] B. Yu, T. Shen, H. Na, L. Chen, and D. Li, “Mineagent: Towards
remote-sensing mineral exploration with multimodal large language
models,” arXiv preprint arXiv:2412.17339, 2024.
[204] H. Ning, Z. Li, T. Akinboyewa, and M. N. Lessani, “An autonomous gis
agent framework for geospatial data retrieval,” International Journal of
Digital Earth, vol. 18, no. 1, p. 2458688, 2025.
[205] Z. Xu, L. Wang, J. Wang, Z. Li, S. Shi, X. Yang, Y. Wang, B. Hu, J. Yu,
and M. Zhang, “Filmagent: A multi-agent framework for end-to-end
film automation in virtual 3d spaces,” arXiv preprint arXiv:2501.12909,
2025.
[206] J. Wang, Z. Du, Y. Zhao, B. Yuan, K. Wang, J. Liang, Y. Zhao, Y. Lu,
G. Li, J. Gao et al., “Aesopagent: Agent-driven evolutionary system on
story-to-video production,” arXiv preprint arXiv:2403.07952, 2024.
[207] S. Han, L. Chen, L.-M. Lin, Z. Xu, and K. Yu, “Ibsen: Director-
actor agent collaboration for controllable and interactive drama script
generation,” arXiv preprint arXiv:2407.01093, 2024.
[208] A. Maronikolakis, A. P. Ramallo, W. Cheng, and T. Kober, “What
should i wear to a party in a greek taverna? evaluation for conversa-
tional agents in the fashion domain,” arXiv preprint arXiv:2408.08907,
2024.
[209] Q. Deng, Q. Yang, R. Yuan, Y. Huang, Y. Wang, X. Liu, Z. Tian,
J. Pan, G. Zhang, H. Lin et al., “Composerx: Multi-agent symbolic
music composition with llms,” arXiv preprint arXiv:2404.18081, 2024.
[210] D. Yu, K. Song, P. Lu, T. He, X. Tan, W. Ye, S. Zhang, and J. Bian,
“Musicagent: An ai agent for music understanding and generation with
large language models,” arXiv preprint arXiv:2310.11954, 2023.
[211] R. Zhang and S. Eger, “Llm-based multi-agent poetry generation
in non-cooperative environments,” arXiv preprint arXiv:2409.03659,
2024.
[212] H.-H. Liu and Y.-W. Liu, “Agent-driven large language models for
mandarin lyric generation,” in 2024 27th Conference of the Orien-
tal COCOSDA International Committee for the Co-ordination and
Standardisation of Speech Databases and Assessment Techniques (O-
COCOSDA). IEEE, 2024, pp. 1–6.
[213] “Introduction to mcp,” accessed: 2025-04-14. [Online]. Available:
https://modelcontextprotocol.io/introduction

Literature Survey 1
No ratings yet
Literature Survey 1
6 pages
Agent Laboratoray 1736610469
No ratings yet
Agent Laboratoray 1736610469
56 pages
AI Agents: LLM vs. Traditional
No ratings yet
AI Agents: LLM vs. Traditional
15 pages
Large Language Model Agent: A Survey On Methodology, Applications and Challenges
No ratings yet
Large Language Model Agent: A Survey On Methodology, Applications and Challenges
26 pages
A Survey On Large Language Model Based Autonomous Agents
No ratings yet
A Survey On Large Language Model Based Autonomous Agents
42 pages
Autoagents: A Framework For Automatic Agent Generation
No ratings yet
Autoagents: A Framework For Automatic Agent Generation
9 pages
Draft 2
No ratings yet
Draft 2
19 pages
REX: Enhancing AI Agents with Rapid Exploration
No ratings yet
REX: Enhancing AI Agents with Rapid Exploration
16 pages
Agents
No ratings yet
Agents
42 pages
Survey LLM-Agents 2025
No ratings yet
Survey LLM-Agents 2025
44 pages
A New Framework and Benchmark For Advancing AI Research Agents
No ratings yet
A New Framework and Benchmark For Advancing AI Research Agents
35 pages
Automated Interview References
No ratings yet
Automated Interview References
16 pages
Empowering Biomedical Discovery With AI Agent
No ratings yet
Empowering Biomedical Discovery With AI Agent
45 pages
Handler - 2023 - A Taxonomy For Autonomous LLM-Powered Multi-Agent Architectures
No ratings yet
Handler - 2023 - A Taxonomy For Autonomous LLM-Powered Multi-Agent Architectures
15 pages
A Survey On Large Language Model Based Autonomous Agents
No ratings yet
A Survey On Large Language Model Based Autonomous Agents
32 pages
Large Language Models (LLMS) : Survey, Technical Frameworks, and Future Challenges
No ratings yet
Large Language Models (LLMS) : Survey, Technical Frameworks, and Future Challenges
51 pages
LLM Agents Making Agent Tools: Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelovi C, Jakob N. Kather
No ratings yet
LLM Agents Making Agent Tools: Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelovi C, Jakob N. Kather
38 pages
Application Driven Valuen Alignment in Agentic AI Systems
No ratings yet
Application Driven Valuen Alignment in Agentic AI Systems
38 pages
Advances and Challenges in Foundation Agents
No ratings yet
Advances and Challenges in Foundation Agents
264 pages
Autoagents: A Framework For Automatic Agent Generation
No ratings yet
Autoagents: A Framework For Automatic Agent Generation
30 pages
NeurIPS 2023 Hugginggpt Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face Paper Conference
No ratings yet
NeurIPS 2023 Hugginggpt Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face Paper Conference
27 pages
Agentic Reasoning: Reasoning Llms With Tools For The Deep Research
No ratings yet
Agentic Reasoning: Reasoning Llms With Tools For The Deep Research
8 pages
A AI S D: A S P, C, F D - : Gentic FOR Cientific Iscovery Urvey OF Rogress Hallenges AND Uture Irec Tions
No ratings yet
A AI S D: A S P, C, F D - : Gentic FOR Cientific Iscovery Urvey OF Rogress Hallenges AND Uture Irec Tions
14 pages
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
No ratings yet
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
27 pages
Agent Survey
No ratings yet
Agent Survey
35 pages
Question-Answer System On Medical Domain With LLMS Using Various Fine-Tuning Methods
No ratings yet
Question-Answer System On Medical Domain With LLMS Using Various Fine-Tuning Methods
15 pages
A Survey of LLM Based On Autonomous Agents
No ratings yet
A Survey of LLM Based On Autonomous Agents
35 pages
LLMs: A Researcher's Guide
No ratings yet
LLMs: A Researcher's Guide
46 pages
How AI Agents Can Help Supercharge Language Models - A Handbook For Developers
No ratings yet
How AI Agents Can Help Supercharge Language Models - A Handbook For Developers
127 pages
Function Calling at Edge
No ratings yet
Function Calling at Edge
9 pages
Survey On Evaluation of LLM-based Agents
No ratings yet
Survey On Evaluation of LLM-based Agents
20 pages
Xagents: A Framework For Interpretable Rule-Based Multi-Agents Cooperation
No ratings yet
Xagents: A Framework For Interpretable Rule-Based Multi-Agents Cooperation
9 pages
Empowering Llms in Task-Oriented Dialogues: A Domain-Independent Multi-Agent Framework and Fine-Tuning Strategy
No ratings yet
Empowering Llms in Task-Oriented Dialogues: A Domain-Independent Multi-Agent Framework and Fine-Tuning Strategy
14 pages
Comparing LLMs Using A Unified Performance Ranking System
No ratings yet
Comparing LLMs Using A Unified Performance Ranking System
13 pages
LLM Based Agents Synopsis
No ratings yet
LLM Based Agents Synopsis
9 pages
LLMs Agents Guide
No ratings yet
LLMs Agents Guide
11 pages
Agentic AI
No ratings yet
Agentic AI
6 pages
LLM AI Agents: Task Planning & Tools
No ratings yet
LLM AI Agents: Task Planning & Tools
36 pages
RAG Pipeline For Domain Specific Applications A Case Study in Disseminating Dementia Care Practices
No ratings yet
RAG Pipeline For Domain Specific Applications A Case Study in Disseminating Dementia Care Practices
5 pages
LMM Model
No ratings yet
LMM Model
41 pages
Overview of Large Language Models
No ratings yet
Overview of Large Language Models
47 pages
LLM Model
No ratings yet
LLM Model
43 pages
Agent Q
No ratings yet
Agent Q
22 pages
Survey
No ratings yet
Survey
23 pages
Survey Agent Optimization Arxiv 2503
No ratings yet
Survey Agent Optimization Arxiv 2503
42 pages
LLMs: A Researcher's Guide
No ratings yet
LLMs: A Researcher's Guide
46 pages
Unveiling The Mathematical Reasoning in Deepseek Models: A Comparative Study of Large Language Models
No ratings yet
Unveiling The Mathematical Reasoning in Deepseek Models: A Comparative Study of Large Language Models
27 pages
LLM Based Multi Ageny
No ratings yet
LLM Based Multi Ageny
15 pages
Navin
No ratings yet
Navin
12 pages
LLMs: Generating Novel Research Ideas
No ratings yet
LLMs: Generating Novel Research Ideas
24 pages
Paper 3
No ratings yet
Paper 3
48 pages
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
No ratings yet
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
25 pages
Toward A Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators
No ratings yet
Toward A Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators
10 pages
Fasdfa
No ratings yet
Fasdfa
27 pages
Overview of Large Language Models
No ratings yet
Overview of Large Language Models
46 pages
Google REST
No ratings yet
Google REST
19 pages
Agentrxiv: Towards Collaborative Autonomous Research
No ratings yet
Agentrxiv: Towards Collaborative Autonomous Research
29 pages
Computational Intelligence Overview
No ratings yet
Computational Intelligence Overview
6 pages
HWR Project Paper
No ratings yet
HWR Project Paper
13 pages
Artificial Intelligence in Education and Training Sector .
No ratings yet
Artificial Intelligence in Education and Training Sector .
1 page
Data-Driven Asset Management Strategies
No ratings yet
Data-Driven Asset Management Strategies
31 pages
LLMs and Machine Learning in CFD
No ratings yet
LLMs and Machine Learning in CFD
11 pages
Frontend Developer Intern, Anjali
No ratings yet
Frontend Developer Intern, Anjali
75 pages
NeurIPS 2019 Workshop Guide
No ratings yet
NeurIPS 2019 Workshop Guide
62 pages
AINAT 2025 Research Presentation
No ratings yet
AINAT 2025 Research Presentation
15 pages
AI-Powered Field Service Optimization
No ratings yet
AI-Powered Field Service Optimization
67 pages
Policy On The Use of Genai and The Phd-Trajectory
No ratings yet
Policy On The Use of Genai and The Phd-Trajectory
9 pages
3.4. A Comprehensive Guide To Convolutional Neural Networks - The ELI5 Way - by Sumit Saha - Towards Data Science
No ratings yet
3.4. A Comprehensive Guide To Convolutional Neural Networks - The ELI5 Way - by Sumit Saha - Towards Data Science
17 pages
Btech Robotics Artificial Intelligence Curriculum Syllabus 2024
No ratings yet
Btech Robotics Artificial Intelligence Curriculum Syllabus 2024
210 pages
Cheat Sheet
0% (1)
Cheat Sheet
80 pages
Nitin Aggarwal's AI & ML Expertise on LinkedIn
No ratings yet
Nitin Aggarwal's AI & ML Expertise on LinkedIn
5 pages
Hedgeweek Multi-Managers Whats Next
No ratings yet
Hedgeweek Multi-Managers Whats Next
19 pages
Skylark AI Launches Purpose-Built AI Engine To Revolutionize Private Investment Analysis and Enterprise AI Deployment
No ratings yet
Skylark AI Launches Purpose-Built AI Engine To Revolutionize Private Investment Analysis and Enterprise AI Deployment
4 pages
Urban AI: Governance & Innovation
No ratings yet
Urban AI: Governance & Innovation
42 pages
HCI Chapter 1
No ratings yet
HCI Chapter 1
23 pages
Predictive Maintenance with AI
100% (1)
Predictive Maintenance with AI
12 pages
402B Deep Learning
100% (1)
402B Deep Learning
82 pages
(Lua) Advanced Aerial AI Documentation, Version 4.2 - Pastebin
No ratings yet
(Lua) Advanced Aerial AI Documentation, Version 4.2 - Pastebin
5 pages
Week 7 - Current Trends and Emerging Technologies
No ratings yet
Week 7 - Current Trends and Emerging Technologies
13 pages
Unit 3 Notes AI
No ratings yet
Unit 3 Notes AI
45 pages
Are Machines Stealing Our Jobs?: Andrea Gentili, Fabiano Compagnucci, Mauro Gallegati and Enzo Valentini
No ratings yet
Are Machines Stealing Our Jobs?: Andrea Gentili, Fabiano Compagnucci, Mauro Gallegati and Enzo Valentini
21 pages
JSW Digital Transformation: Private and Confidential April 2019
100% (1)
JSW Digital Transformation: Private and Confidential April 2019
33 pages
Plant Information Modelling, Using Artificial Intelligence, For Process Hazard and Risk Analysis Study
No ratings yet
Plant Information Modelling, Using Artificial Intelligence, For Process Hazard and Risk Analysis Study
143 pages
Artificial Intelligence For Business
No ratings yet
Artificial Intelligence For Business
2 pages
欢迎来到essay搜索网站！
100% (2)
欢迎来到essay搜索网站！
6 pages
Immediate Download Forecasting With Artificial Intelligence: Theory and Applications Mohsen Hamoudia Ebooks 2024
100% (2)
Immediate Download Forecasting With Artificial Intelligence: Theory and Applications Mohsen Hamoudia Ebooks 2024
47 pages
Villar 2021
No ratings yet
Villar 2021
16 pages