Function Calling at Edge
Function Calling at Edge
                                                                                                         1
   In this work, we demonstrate that this is feasible       boundary by enabling efficient inference via paral-
by training smaller models with specialized, high-          lel function calling (Kim et al., 2023) as well as a
quality data that does not require recalling generic        novel tool retrieval method, similar to (Moon et al.,
world knowledge. Our goal is to develop Small               2024). Furthermore, our method does not require
Language Models (SLMs) that can be securely and             any architectural changes, making it compatible
privately deployed at the edge while maintaining            with a wider range of open-source models.
the complex reasoning capability to understand
natural language queries and orchestrate tools and          2.2   Dataset Synthesis
APIs to accomplish user commands.
                                                            To address the problem of not having enough data
   To achieve this, we first explore enabling small         for finetuning, a popular method has emerged to
open-source models to perform accurate function             use LLMs to synthesize new training datapoints
calling, a key component of agentic systems. Off-           (Deng et al., 2023; Prasad et al., 2023; Fu et al.,
the-shelf SLMs often lack sophisticated function            2023; Dai et al., 2023; Ubani et al., 2023; Fang
calling capabilities and require fine-tuning. Next,         et al., 2023; Liu et al., 2023; Yu et al., 2023; Kumar
we discuss systematically curating high-quality             et al., 2020; Yoo et al., 2021; Wang et al., 2022;
function calling datasets to train these SLMs, using        Lee et al., 2024b). While these techniques create
a specialized Mac assistant agent as our primary            very good results, they often generate a significant
application. We demonstrate that fine-tuning                amount of training data. Recent advancements
the models on this curated dataset can enable               have shown that by filtering these datasets or
SLMs to exceed GPT-4-Turbo’s function calling               generating smaller, higher quality datasets, one can
performance. Finally, we enhance the inference              achieve similar or better performance (Chen et al.,
efficiency of these fine-tuned models using a novel         2023; Cao et al., 2023; Wei et al., 2023; Zhou
Tool RAG method and quantization, allowing for              et al., 2023). TinyAgent builds on these works
efficient edge deployment with real-time responses.         by constructing a pipeline that systematically gen-
                                                            erates high-quality, task-specific function-calling
2     Related Work                                          datasets, ensuring efficient training and robust
                                                            performance even with smaller, curated datasets.
2.1    Function Calling LLMs
                                                            2.3   Device Control
The sophisticated reasoning capabilities of recent
LLMs have enabled them to call functions (i.e.,             Recent advancements in device control have
tools), where LLMs determine which function to              introduced large-scale benchmarks and datasets
invoke among user-provided functions along with             focused on the Android environment (Rawles et al.,
the associated arguments. This allows LLMs to               2024b; Zhang et al., 2024b; Rawles et al., 2024a;
use external functions (e.g. calculators or search          Lee et al., 2024a), which explore UI-based agents
engines) to provide more accurate answers to user           with low-level controls such as typing, scrolling,
queries than by responding directly. A pioneering           and tapping. They are primarily concerned with
work in this area is Toolformer (Schick et al.,             mobile device interactions in simulated environ-
2024), which has inspired various tool-calling              ments, but they do not address the challenges
frameworks (Ruan et al., 2023; Shen et al., 2024;           of deploying small language models directly
Liang et al., 2024). ReAct (Yao et al., 2022)               on the device, which is crucial for real-world
introduced a reasoning-and-action process that              applications where cloud resources are unavailable
improved LLMs’ interaction with external environ-           or impractical. More recently, UFO (Zhang et al.,
ments, which has become a back-bone for different           2024a) introduced a dual-agent framework that
open-source frameworks (Liu, 2022; Langchain).              leverages vision and language to enable UI-focused
More recently, Gorilla (Patil et al., 2023) and             agents to operate within Windows OS applications.
ToolLLM (Qin et al., 2023) have demonstrated that           However, similar to earlier works, UFO also
an open-source LLM can be fine-tuned to obtain              focuses on low-level control mechanisms and does
function-calling capabilities in diverse real-world         not address the deployment of small language
use cases. One noticeable work is Octopus (Chen             models directly on the device. TinyAgent pushes
et al., 2024) which introduces on-device LLMs               this boundary by formulating device control as
that invoke software APIs. TinyAgent pushes this            a high-level function-calling problem instead
                                                        2
of low-level UI actions, utilizing task-specific               This is rather expected because these small
abstractions that allow for more robust and efficient       models have been trained on generic datasets and
execution of commands. By running fully locally             primarily targeted to achieve good accuracy on
on MacOS, TinyAgent offers a more realistic and             general benchmarks which mostly test the model’s
practical solution for device control, making it            world knowledge and general reasoning or basic
well-suited for real-life scenarios where on-device         instruction following capability. To address this,
deployment is necessary.                                    we explored if fine-tuning these models on a
                                                            high-quality dataset specially curated for function
3     TinyAgent                                             calling and planning can improve the accuracy
                                                            of these small language models for a targeted
3.1    Teaching LLMs to do Function Calling                 task, potentially outperforming larger models. In
                                                            Section 3.2, we first discuss how we generated
As mentioned above, our main interest is applica-
                                                            such a dataset, and then we discuss the fine-tuning
tions where the AI agent translates the user query
                                                            approach in Section 3.3.
into a sequence of function calls to complete the
tasks. In such applications, the model does not
                                                            3.2   Dataset Generation
need to write the function definition itself since
the functions (or APIs) are mostly pre-defined              As a driving application, we consider a local
and already available. Therefore, what the model            agentic system for Apple’s Macbook that solves
needs to do is to determine (i) which functions             user’s day-to-day tasks. Particularly, the agent
to call, (ii) the corresponding input arguments,            is equipped with 16 different functions that can
and (iii) the right order of calling these functions        interact with different applications on Mac, which
(i.e. function orchestration) based on the required         includes:
interdependency across the function calls.
                                                            • Email: Compose a new email or reply to/forward
   The first question is to find an effective way to          emails
equip SLMs to perform function calling. Large               • Contacts: Retrieve phone numbers or email
models such as GPT-4 are able to perform function             addresses from the contacts database
calling, but how can this be achieved with open
                                                            • SMS: Send text messages to contact(s)
source models? LLMCompiler (Kim et al., 2023)
                                                            • Calendar: Create calendar events with details
is a recent framework that enables this by instruct-
                                                              such as title, time, attendees, etc.
ing the LLM to output a function calling plan that
includes the set of functions that it needs to call         • Notes: Create, open, or append content to notes
along with the input arguments and their depen-               in various folders
dencies (see the example in Figure 1). Once this            • Reminder: Set reminders for various activities
function calling plan is generated, we can parse it           and tasks
and call each function based on the dependencies.           • File management: Open, read, or summarize
   The critical part here is how to teach the model           documents in various file paths
to create this function calling plan with the right         • Zoom meetings: Schedule and organize Zoom
syntax and dependency . The original LLMCom-                  meetings
piler (Kim et al., 2023) only considered large
models, such as LLaMA-2 70B (Touvron et al.,                   Predefined Apple scripts exist for each of these
2023), which have complex reasoning capabilities            functions/tools, and all that the model needs to
to create the plan when provided with sufficient            do is to take advantage of the predefined APIs
instructions in their prompts. Unfortunately,               and determine the right function calling plan
our initial experiments showed that off-the-shelf           to accomplish a given task, such as in Figure 1.
small models such as TinyLlama-1.1B (Zhang                  However, as discussed previously, we need a
et al., 2024c) (or even the larger Wizard-2-7B              dataset for training and evaluating SLMs since their
model (Vince, 2024)) are not able to output the             off-the-shelf function calling capability is subpar.
correct plans when prompted the same way. The                 Creating handcrafted data with diverse function
errors ranged from problems such as using the               calling plans is both challenging and not scalable.
wrong set of functions, hallucinated names, wrong           However, we can curate synthetic data using
dependencies, and inconsistent syntax.                      a powerful LLM like GPT-4-Turbo. Such an
                                                        3
                                                                                  Function Calling Planner
                                          User Input
                                  “Create a calendar invite           $1 = get_email_address(“Lutfi”)
                                                                      $2 = get_email_address(“Sid”)
                                  with Lutfi and Sid at 2pm
                                                                      $3 = create_calendar_event(
                                    tomorrow to discuss
                                                                           [$1, $2], “4/24 2PM”, “TinyAgent Discussion”)
                                         TinyAgent”
                                                                      $4 = join()
Figure 1: Overview of the LLMCompiler Function Calling Planner. The Planner understands the user query and
generates a sequence of tasks with their inter-dependencies. These tasks are then dispatched by the LLMCompiler
framework to accomplish the user command. In this example, Task $1 and $2 are fetched together to retrieve the
email addresses of Sid and Lutfi independently. After each task is performed, the results are forwarded to Task $3
which creates the calendar event. Before executing Task $3, LLMCompiler replaces the placeholder variables (e.g.,
the variable $1 and $2 in Task $3) with actual values.
                                                                              ≠
                                                                                      $1 = get_phone_number(“Lutfi”)     $2 = get_email_address(“Sid”)
Figure 2: Graph Isomorphism Success Rate. The model scores a success rate of 1 only if the DAG of its generated
plan is isomorphic to the DAG of the ground truth plan; and 0 otherwise. In the above example, for the top case,
although the order of the get_email_address calls are different from the ground truth plan (the ground truth plan
gets the email address of Lutfi before Sid, and the generated plan gets the email address of Sid before Lutfi), since
the two DAGs are isomorphic to each other, the plan gets 1 success rate. For the bottom case, since the predicted
DAG contains a wrong node, corresponding to a wrong function call, the plan gets 0 success rate.
                                                                                4
                                                                                                                      create_calendar_event
                                                                       DeBERTa
                                User Input
                                                                                                 Classification
                                                                                                                      compose_new_email
                         “Create a calendar invite
                                                                                 Layer N
                                                             Layer 1
                                                                                                     Head
                         with Lutfi and Sid at 2pm                                                                    summarize_pdf
                                                                         …
                                tomorrow”                                                                             reply_to_email
                                                                                                                            …
                                                                                                                      get_email_address
Figure 3: Overview of our Tool RAG scheme. We formulate tool retrieval as a multi-label classification problem.
The user query is given as input to the fine-tuned DeBERTa-v3-small model, which outputs a 16-dimensional vector
indicating tool probabilities. Tools with probabilities higher than 50% are selected, averaging 3.97 tools per query
compared to 6 tools in basic RAG.
                                                                             5
Table 1: Comparison of TinyAgent performance with DeBERTa to Basic RAG and no RAG settings. For Basic
RAG, we retrieved top-3 most relevant tools. For our fine-tuned DeBERTa-v3-small model, we retrieved tools with
a probability greater than 50%, which retrieves ∼3.97 tools per query.
Table 2: Latency, size, and success rate of TinyAgent models before and after quantization. Latency is the end-to-end
latency of the function calling planner, including the prompt processing time and generation.
                Model         Weight Precision   Latency (seconds)           Model Size (GB)     Success Rate (%)
              GPT-3.5             Unknown                  3.2                  Unknown               65.04
             GPT-4-Turbo          Unknown                  3.9                  Unknown               79.08
                                     16                    3.9                    2.2                 80.06
            TinyAgent-1.1B
                                     4                     2.9                    0.68                80.35
                                     16                  19.5                     14.5                84.95
             TinyAgent-7B
                                     4                   13.1                     4.37                85.14
auxiliary tool is not similar to the user query. For              3.5      Fast Edge Deployment with Quantization
instance, the example shown in Figure 4 requires
calling get_email_address function even though                    Deploying models at the edge, such as on consumer
the user query is just asking about creating a                    MacBooks, can still be challenging even for small
calendar invitation.                                              models with O(1B) parameters, since loading the
                                                                  model parameters can consume a large portion
   This can be addressed by treating the problem as
                                                                  of the available memory. A solution to these
a classification of which tools are needed. To that
                                                                  issues is quantization, which allows us to store
end, we fine-tuned a DeBERTa-v3-small (He et al.,
                                                                  the model at a reduced bit precision. Quantization
2021) model on the training data to perform a 16-
                                                                  not only reduces the storage requirements and
way classification as shown in Figure 3. The user
                                                                  model footprint, but also cuts down the time and
query is given as an input to this model, and then we
                                                                  resources needed to load model weights into mem-
pass the CLS token at the end through a simple fully
                                                                  ory, thereby reducing the overall inference latency
connected layer of size 768x16 to transform it into a
                                                                  as well. For more information on quantization,
16 dimensional vector (which is the total size of our
                                                                  refer to (Gholami et al., 2022).
tools). The output of this layer is passed through a
sigmoid layer to produce the probability of select-                  To more efficiently deploy the models, we
ing each tool. During inference, we select the tools              quantized the models into 4-bit with a group size of
that have probably higher than 50%, and if so, we                 32, which is supported by the llama.cpp framework
include their description in the prompt. On average               with quantization-aware training. As shown in
we noticed that only 3.97 tools are retrieved with a              Table 2, the 4-bit models result in 30% better
recall of 0.998, whereas the basic RAG requires us-               latency, along with a 4x reduction in the model
ing the top 6 tools to achieve a tool recall of 0.968.            size. We also notice slight accuracy improvement
                                                                  which is due to the additional fine-tuning with
   We evaluated the model performance after
                                                                  simulated quantization.
incorporating Tool RAG. The results are shown
in Table 1, where we report the performance of
the simple RAG system along with the fine-tuned                   4       Putting It All Together
DeBERTa approach. As one can see, the DeBERTa
based Tool RAG method achieves almost perfect
                                                                 We provide a demo video of the final TinyAgent-
recall performance, improves the baseline accuracy,
                                                                 1.1B model deployed on a Macbook Pro M33 ,
while reducing the prompt size by ∼2x tokens.
                                                                 which can be downloaded and tested on Mac
                                                                      3
                                                                          https://www.youtube.com/watch?v=0GvaGL9IDpQ
                                                            6
from the link4 . It not only runs all of the model          abilities, ensuring that everyone can benefit from
inference locally on your computer, but it also             TinyAgent’s capabilities without barriers.
allows you to provide commands through audio.                  Human Oversight: While TinyAgent demon-
We process the audio locally as well using the              strates robust capabilities in function calling, the
Whisper-v3 (Radford et al., 2022) model from                risk of hallucination and erroneous responses by
OpenAI deployed locally using the whisper.cpp               LLMs remains (Zhang et al., 2023). To mitigate
framework. The greatest surprise for us was that            this, it is essential to maintain human oversight
the accuracy of the 1.1B model exceeds that of              throughout the operational loop, not just at the end-
GPT-4-Turbo, and is markedly fast while deployed            point. This means integrating mechanisms for reg-
locally and privately on-device.                            ular checks and balances where humans can review,
                                                            override, or refine decisions made by TinyAgent.
5       Conclusions                                         Future iterations of our system will aim to facilitate
                                                            even more seamless human-agent collaboration to
To summarize, we introduced TinyAgent and                   enhance decision accuracy and reliability.
showed that it is indeed possible to train a small
                                                               Cultural and Bias Considerations: Synthetic
language model and use it to power a semantic sys-
                                                            datasets generated using simple or naive prompts
tem that processes user queries. In particular, we
                                                            often carry inherent biases, such as those related
considered a Siri-like assistant for Mac as a driving
                                                            to regional or cultural specificity (Yu et al., 2024).
application. The key components for enabling it is
                                                            Because task-specific agent systems like TinyA-
to (i) teach off-the-shelf SLMs to perform function
                                                            gent rely on synthetic data, their effectiveness and
calling through LLMCompiler framework, (ii)
                                                            impartiality can be impacted when operating across
curate high quality function calling data for the
                                                            different demographic landscapes. In response,
task at hand, (iii) fine-tune the off-the-shelf model
                                                            we integrate diverse cultural data and demographic
on the generated data, and (iv) enable efficient
                                                            groups in our data generation processes to mitigate
deployment by optimizing the prompt size through
                                                            these biases. Our aim is to ensure that the synthetic
only retrieving the necessary tools based on the
                                                            data fueling TinyAgent is as inclusive and unbiased
user query through Tool RAG, as well as quantized
                                                            as possible, supporting a function-calling system
model deployment to reduce inference resource
                                                            that is culturally aware and equitably serves a
consumption. After these steps, our final models
                                                            global user base.
achieved 80.06% and 84.95% for the TinyAgent-
1.1.B and 7B models which exceed GPT-4-Turbo’s
                                                            Acknowledgements
success rate of 79.08% on this task.
                                                            We would like to thank Apple for sponsoring this
6       Ethics Statement                                    project, as well as support from Microsoft through
                                                            Accelerating Foundation Models Research Pro-
Deploying TinyAgent to operate agentic systems              gram. We also thank Sunjin Choi for his insights
at the edge presents several ethical considerations         in energy cost associated with local and cloud
that are integral to our design and operational             deployment. Our conclusions do not necessarily
philosophy.                                                 reflect the position or the policy of our sponsors,
  Accessibility and Inclusivity: Ensuring that              and no official endorsement should be inferred.
TinyAgent serves all users equitably, including
those with disabilities, is a priority. We are com-
mitted to designing interfaces that are universally         References
accessible, incorporating features such as voice            Yihan Cao, Yanbin Kang, Chi Wang, and Lichao
recognition that can understand diverse speech                Sun. 2023. Instruction mining: When data mining
patterns and text-to-speech technologies that                 meets large language model finetuning. Preprint,
are clear and easily comprehensible. Further,                 arXiv:2307.06290.
we are exploring adaptive technologies that can             Lichang Chen, Shiyang Li, Jun Yan, Hai Wang,
adjust to the specific needs of users with varying            Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay
                                                              Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia
    4                                                         Jin. 2023. Alpagasus: Training a better alpaca with
    https://github.com/SqueezeAILab/TinyAgent/
raw/main/TinyAgent.zip                                        fewer data. Preprint, arXiv:2307.08701.
                                                        7
Wei Chen, Zhiyuan Li, and Mingyuan Ma. 2024.                    Amir Gholami. 2024b. Llm2llm: Boosting llms with
 Octopus: On-device language model for function                 novel iterative data enhancement. arXiv preprint
 calling of software apis. Preprint, arXiv:2404.01549.          arXiv:2403.15042.
Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke            Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan
  Huang, Yihan Cao, Zihao Wu, Lin Zhao, Shaochen                Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang
  Xu, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu,              Mao, et al. 2024. Taskmatrix. ai: Completing tasks by
  Hongmin Cai, Lichao Sun, Quanzheng Li, Dinggang               connecting foundation models with millions of apis.
  Shen, Tianming Liu, and Xiang Li. 2023. Auggpt:               Intelligent Computing, 3:0063.
  Leveraging chatgpt for text data augmentation.
  Preprint, arXiv:2302.13007.                                 Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan
                                                                Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward,
Yihe Deng, Weitong Zhang, Zixiang Chen, and                     and Yi Zhang. 2023. Tinygsm achieving 80%
  Quanquan Gu. 2023. Rephrase and respond: Let large            on gsm8k with small language models. Preprint,
  language models ask better questions for themselves.          arXiv:2312.09241.
  Preprint, arXiv:2311.04205.
                                                              Jerry Liu. 2022. LlamaIndex.
Luyang Fang, Gyeong-Geon Lee, and Xiaoming Zhai.
                                                              Suhong Moon, Siddharth Jha, Lutfi Eren Erdogan,
  2023. Using gpt-4 to augment unbalanced data for
                                                                Sehoon Kim, Woosang Lim, Kurt Keutzer, and Amir
  automatic scoring. Preprint, arXiv:2310.18365.
                                                                Gholami. 2024. Efficient and scalable estimation of
Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and                tool representations in vector space. arXiv preprint
  Tushar Khot. 2023. Specializing smaller language              arXiv:2409.02141.
  models towards multi-step reasoning. arXiv preprint         OpenAI. 2024. Hello gpt-4o. Accessed: 2024-07-29.
  arXiv:2301.12726.
                                                              Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E
Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao,                Gonzalez. 2023. Gorilla: Large language model
 Michael W Mahoney, and Kurt Keutzer. 2022. A                   connected with massive apis.         arXiv preprint
 survey of quantization methods for efficient neural            arXiv:2305.15334.
 network inference. In Low-Power Computer Vision,
 pages 291–326. Chapman and Hall/CRC.                         Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal.
                                                                2023. Rephrase, augment, reason: Visual grounding
Google. 2024. Google gemini: Next generation model.             of questions for vision-language models. Preprint,
  Accessed: 2024-07-29.                                         arXiv:2310.05861.
Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021.            Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan
  Debertav3: Improving deberta using electra-style              Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang,
  pre-training with gradient-disentangled embedding             Bill Qian, et al. 2023. Toolllm: Facilitating large
  sharing. arXiv preprint arXiv:2111.09543.                     language models to master 16000+ real-world apis.
                                                                arXiv preprint arXiv:2307.16789.
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
  Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu            Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman,
  Chen. 2021. Lora: Low-rank adaptation of large                Christine McLeavey, and Ilya Sutskever. 2022.
  language models. arXiv preprint arXiv:2106.09685.             Robust speech recognition via large-scale weak
                                                                supervision. Preprint, arXiv:2212.04356.
Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas
  Lee, Michael W Mahoney, Kurt Keutzer, and Amir              Christopher Rawles, Sarah Clinckemaillie, Yifan
  Gholami. 2023. An llm compiler for parallel function          Chang, Jonathan Waltz, Gabrielle Lau, Marybeth
  calling. arXiv preprint arXiv:2312.04511.                     Fair, Alice Li, William Bishop, Wei Li, Folawiyo
                                                                Campbell-Ajala, Daniel Toyama, Robert Berry, Divya
Varun Kumar, Ashutosh Choudhary, and Eunah Cho.                 Tyamagundlu, Timothy Lillicrap, and Oriana Riva.
  2020. Data augmentation using pre-trained trans-              2024a. Androidworld: A dynamic benchmarking
  former models. In Proceedings of the 2nd Workshop             environment for autonomous agents. Preprint,
  on Life-long Learning for Spoken Language Systems,            arXiv:2405.14573.
  pages 18–26.
                                                              Christopher Rawles, Alice Li, Daniel Rodriguez,
Langchain. https://github.com/langchain-ai/langchain.           Oriana Riva, and Timothy Lillicrap. 2024b. An-
                                                                droidinthewild: A large-scale dataset for android
Juyong Lee, Taywon Min, Minyong An, Changyeon                   device control. Advances in Neural Information
  Kim, and Kimin Lee. 2024a. Benchmarking mobile                Processing Systems, 36.
  device control agents across diverse configurations.
  arXiv preprint arXiv:2404.16660.                            Jingqing Ruan, Yihong Chen, Bin Zhang, Zhiwei Xu,
                                                                 Tianpeng Bao, Guoqing Du, Shiwei Shi, Hangyu Mao,
Nicholas Lee, Thanakul Wattanawong, Sehoon Kim,                  Xingyu Zeng, and Rui Zhao. 2023. Tptu: Task plan-
  Karttikeya Mangalam, Sheng Shen, Gopala Anu-                   ning and tool usage of large language model-based
  manchipali, Michael W Mahoney, Kurt Keutzer, and               ai agents. arXiv preprint arXiv:2308.03427.
                                                          8
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì,                    Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo
  Roberta Raileanu, Maria Lomeli, Eric Hambro,                     Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. 2024b.
  Luke Zettlemoyer, Nicola Cancedda, and Thomas                    Android in the zoo: Chain-of-action-thought for gui
  Scialom. 2024. Toolformer: Language models can                   agents. Preprint, arXiv:2403.02713.
  teach themselves to use tools. Advances in Neural
  Information Processing Systems, 36.                           Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and
                                                                  Wei Lu. 2024c. Tinyllama: An open-source small
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li,                language model. arXiv preprint arXiv:2401.02385.
  Weiming Lu, and Yueting Zhuang. 2024. Hugginggpt:
  Solving ai tasks with chatgpt and its friends in              Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu,
  hugging face. Advances in Neural Information                    Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang,
  Processing Systems, 36.                                         Yulong Chen, et al. 2023. Siren’s song in the ai ocean:
                                                                  a survey on hallucination in large language models.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,            arXiv preprint arXiv:2309.01219.
  Amjad Almahairi, Yasmine Babaei, Nikolay Bash-
  lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos-          Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao
  ale, et al. 2023. Llama 2: Open foundation and fine-            Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu,
  tuned chat models. arXiv preprint arXiv:2307.09288.             Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis,
                                                                  Luke Zettlemoyer, and Omer Levy. 2023. Lima: Less
Solomon Ubani, Suleyman Olcay Polat, and Rodney                   is more for alignment. Preprint, arXiv:2305.11206.
  Nielsen. 2023. Zeroshotdataaug: Generating and
  augmenting training data with chatgpt. arXiv preprint
  arXiv:2304.14334.