0% found this document useful (0 votes)
142 views9 pages

Function Calling at Edge

Uploaded by

quantum.2681
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views9 pages

Function Calling at Edge

Uploaded by

quantum.2681
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

TinyAgent: Function Calling at the Edge

Lutfi Eren Erdogan∗1 Nicholas Lee∗1 Siddharth Jha∗1 Sehoon Kim1


Ryan Tabrizi1 Suhong Moon1 Coleman Hooper1
Gopala Anumanchipalli1 Kurt Keutzer1 Amir Gholami1,2
1 2
UC Berkeley ICSI
{lerdogan, nicholas.lee, sidjha, sehoonkim, rtabrizi, suhong.moon, chooper, gopala, keutzer, amirgh}@berkeley.edu

Abstract efforts such as the GPT-4o (OpenAI, 2024) or


Gemini-1.5 (Google, 2024), has expanded the
Recent large language models (LLMs) have en- realm of possibilities with AI agents. However, the
abled the development of advanced agentic sys- large model size and computational requirements
arXiv:2409.00608v3 [cs.CL] 25 Oct 2024

tems that can integrate various tools and APIs


of these models often requires their inference
to fulfill user queries through function calling.
However, the deployment of these LLMs on the to be performed on the cloud. This can create
edge has not been explored since they typically several challenges for their widespread adoption.
require cloud-based infrastructure due to their First, uploading data such as video, audio, or text
substantial model size and computational de- documents to a third-party vendor on the cloud,
mands. To this end, we present TinyAgent, an can result in privacy issues. Second, this requires
end-to-end framework for training and deploy- cloud/Wi-Fi connectivity which is not always
ing task-specific small language model agents possible. For instance, a robot deployed in the real
capable of function calling for driving agentic
world may not always have a stable connection.
systems at the edge. We first show how to enable
accurate function calling for open-source mod- Besides that, latency could also be an issue as
els via the LLMCompiler framework. We then uploading large amounts of data to the cloud and
systematically curate a high-quality dataset for waiting for the response could slow down response
function calling, which we use to fine-tune two time, resulting in unacceptable time-to-solution.
small language models, TinyAgent-1.1B and These challenges could be solved if we deploy the
7B. For efficient inference, we introduce a novel LLM models locally at the edge.
tool retrieval method to reduce the input prompt
length and utilize quantization to further accel- Current LLMs like GPT-4o (OpenAI, 2024) or
erate the inference speed. As a driving applica- Gemini-1.5 (Google, 2024) are too large for local
tion, we demonstrate a local Siri-like system for deployment. One contributing factor is that a lot of
Apple’s MacBook that can execute user com- the model size ends up memorizing general infor-
mands through text or voice input. Our results mation about the world into its parametric memory
show that our models can achieve, and even sur-
which may not be necessary for a specialized down-
pass, the function-calling capabilities of larger
models like GPT-4-Turbo, while being fully de- stream application. For instance, if you ask a gen-
ployed at the edge. We open-source our dataset, eral factual question to these models like a histor-
models, and installable package1 and provide a ical event or well-known figures, they can produce
demo video for our MacBook assistant agent2 . the results using their parametric memory, even
without having additional context in their prompt.
1 Introduction This implicit memorization of training data into
the parametric memory might be correlated with
The ability of LLMs to execute commands “emergent” phenomena in LLMs such as in-context
through plain language (e.g. English) has enabled learning and complex reasoning, which has been
agentic systems that can complete a user query the driving force behind scaling the model size.
by orchestrating the right set of tools (e.g. Tool-
This leads to an intriguing research question:
Former (Schick et al., 2024), Gorilla (Patil et al.,
2023)). This, along with the recent multi-modal Can a smaller language model with significantly
less parametric memory emulate such emergent
*Equal contribution
1
https://github.com/SqueezeAILab/TinyAgent ability of these larger language models?
2
https://www.youtube.com/watch?v=0GvaGL9IDpQ

1
In this work, we demonstrate that this is feasible boundary by enabling efficient inference via paral-
by training smaller models with specialized, high- lel function calling (Kim et al., 2023) as well as a
quality data that does not require recalling generic novel tool retrieval method, similar to (Moon et al.,
world knowledge. Our goal is to develop Small 2024). Furthermore, our method does not require
Language Models (SLMs) that can be securely and any architectural changes, making it compatible
privately deployed at the edge while maintaining with a wider range of open-source models.
the complex reasoning capability to understand
natural language queries and orchestrate tools and 2.2 Dataset Synthesis
APIs to accomplish user commands.
To address the problem of not having enough data
To achieve this, we first explore enabling small for finetuning, a popular method has emerged to
open-source models to perform accurate function use LLMs to synthesize new training datapoints
calling, a key component of agentic systems. Off- (Deng et al., 2023; Prasad et al., 2023; Fu et al.,
the-shelf SLMs often lack sophisticated function 2023; Dai et al., 2023; Ubani et al., 2023; Fang
calling capabilities and require fine-tuning. Next, et al., 2023; Liu et al., 2023; Yu et al., 2023; Kumar
we discuss systematically curating high-quality et al., 2020; Yoo et al., 2021; Wang et al., 2022;
function calling datasets to train these SLMs, using Lee et al., 2024b). While these techniques create
a specialized Mac assistant agent as our primary very good results, they often generate a significant
application. We demonstrate that fine-tuning amount of training data. Recent advancements
the models on this curated dataset can enable have shown that by filtering these datasets or
SLMs to exceed GPT-4-Turbo’s function calling generating smaller, higher quality datasets, one can
performance. Finally, we enhance the inference achieve similar or better performance (Chen et al.,
efficiency of these fine-tuned models using a novel 2023; Cao et al., 2023; Wei et al., 2023; Zhou
Tool RAG method and quantization, allowing for et al., 2023). TinyAgent builds on these works
efficient edge deployment with real-time responses. by constructing a pipeline that systematically gen-
erates high-quality, task-specific function-calling
2 Related Work datasets, ensuring efficient training and robust
performance even with smaller, curated datasets.
2.1 Function Calling LLMs
2.3 Device Control
The sophisticated reasoning capabilities of recent
LLMs have enabled them to call functions (i.e., Recent advancements in device control have
tools), where LLMs determine which function to introduced large-scale benchmarks and datasets
invoke among user-provided functions along with focused on the Android environment (Rawles et al.,
the associated arguments. This allows LLMs to 2024b; Zhang et al., 2024b; Rawles et al., 2024a;
use external functions (e.g. calculators or search Lee et al., 2024a), which explore UI-based agents
engines) to provide more accurate answers to user with low-level controls such as typing, scrolling,
queries than by responding directly. A pioneering and tapping. They are primarily concerned with
work in this area is Toolformer (Schick et al., mobile device interactions in simulated environ-
2024), which has inspired various tool-calling ments, but they do not address the challenges
frameworks (Ruan et al., 2023; Shen et al., 2024; of deploying small language models directly
Liang et al., 2024). ReAct (Yao et al., 2022) on the device, which is crucial for real-world
introduced a reasoning-and-action process that applications where cloud resources are unavailable
improved LLMs’ interaction with external environ- or impractical. More recently, UFO (Zhang et al.,
ments, which has become a back-bone for different 2024a) introduced a dual-agent framework that
open-source frameworks (Liu, 2022; Langchain). leverages vision and language to enable UI-focused
More recently, Gorilla (Patil et al., 2023) and agents to operate within Windows OS applications.
ToolLLM (Qin et al., 2023) have demonstrated that However, similar to earlier works, UFO also
an open-source LLM can be fine-tuned to obtain focuses on low-level control mechanisms and does
function-calling capabilities in diverse real-world not address the deployment of small language
use cases. One noticeable work is Octopus (Chen models directly on the device. TinyAgent pushes
et al., 2024) which introduces on-device LLMs this boundary by formulating device control as
that invoke software APIs. TinyAgent pushes this a high-level function-calling problem instead

2
of low-level UI actions, utilizing task-specific This is rather expected because these small
abstractions that allow for more robust and efficient models have been trained on generic datasets and
execution of commands. By running fully locally primarily targeted to achieve good accuracy on
on MacOS, TinyAgent offers a more realistic and general benchmarks which mostly test the model’s
practical solution for device control, making it world knowledge and general reasoning or basic
well-suited for real-life scenarios where on-device instruction following capability. To address this,
deployment is necessary. we explored if fine-tuning these models on a
high-quality dataset specially curated for function
3 TinyAgent calling and planning can improve the accuracy
of these small language models for a targeted
3.1 Teaching LLMs to do Function Calling task, potentially outperforming larger models. In
Section 3.2, we first discuss how we generated
As mentioned above, our main interest is applica-
such a dataset, and then we discuss the fine-tuning
tions where the AI agent translates the user query
approach in Section 3.3.
into a sequence of function calls to complete the
tasks. In such applications, the model does not
3.2 Dataset Generation
need to write the function definition itself since
the functions (or APIs) are mostly pre-defined As a driving application, we consider a local
and already available. Therefore, what the model agentic system for Apple’s Macbook that solves
needs to do is to determine (i) which functions user’s day-to-day tasks. Particularly, the agent
to call, (ii) the corresponding input arguments, is equipped with 16 different functions that can
and (iii) the right order of calling these functions interact with different applications on Mac, which
(i.e. function orchestration) based on the required includes:
interdependency across the function calls.
• Email: Compose a new email or reply to/forward
The first question is to find an effective way to emails
equip SLMs to perform function calling. Large • Contacts: Retrieve phone numbers or email
models such as GPT-4 are able to perform function addresses from the contacts database
calling, but how can this be achieved with open
• SMS: Send text messages to contact(s)
source models? LLMCompiler (Kim et al., 2023)
• Calendar: Create calendar events with details
is a recent framework that enables this by instruct-
such as title, time, attendees, etc.
ing the LLM to output a function calling plan that
includes the set of functions that it needs to call • Notes: Create, open, or append content to notes
along with the input arguments and their depen- in various folders
dencies (see the example in Figure 1). Once this • Reminder: Set reminders for various activities
function calling plan is generated, we can parse it and tasks
and call each function based on the dependencies. • File management: Open, read, or summarize
The critical part here is how to teach the model documents in various file paths
to create this function calling plan with the right • Zoom meetings: Schedule and organize Zoom
syntax and dependency . The original LLMCom- meetings
piler (Kim et al., 2023) only considered large
models, such as LLaMA-2 70B (Touvron et al., Predefined Apple scripts exist for each of these
2023), which have complex reasoning capabilities functions/tools, and all that the model needs to
to create the plan when provided with sufficient do is to take advantage of the predefined APIs
instructions in their prompts. Unfortunately, and determine the right function calling plan
our initial experiments showed that off-the-shelf to accomplish a given task, such as in Figure 1.
small models such as TinyLlama-1.1B (Zhang However, as discussed previously, we need a
et al., 2024c) (or even the larger Wizard-2-7B dataset for training and evaluating SLMs since their
model (Vince, 2024)) are not able to output the off-the-shelf function calling capability is subpar.
correct plans when prompted the same way. The Creating handcrafted data with diverse function
errors ranged from problems such as using the calling plans is both challenging and not scalable.
wrong set of functions, hallucinated names, wrong However, we can curate synthetic data using
dependencies, and inconsistent syntax. a powerful LLM like GPT-4-Turbo. Such an

3
Function Calling Planner
User Input
“Create a calendar invite $1 = get_email_address(“Lutfi”)
$2 = get_email_address(“Sid”)
with Lutfi and Sid at 2pm
$3 = create_calendar_event(
tomorrow to discuss
[$1, $2], “4/24 2PM”, “TinyAgent Discussion”)
TinyAgent”
$4 = join()

DAG of Function Calling Tasks

Figure 1: Overview of the LLMCompiler Function Calling Planner. The Planner understands the user query and
generates a sequence of tasks with their inter-dependencies. These tasks are then dispatched by the LLMCompiler
framework to accomplish the user command. In this example, Task $1 and $2 are fetched together to retrieve the
email addresses of Sid and Lutfi independently. After each task is performed, the results are forwarded to Task $3
which creates the calendar event. Before executing Task $3, LLMCompiler replaces the placeholder variables (e.g.,
the variable $1 and $2 in Task $3) with actual values.

Correctly Generated DAG (Score: 1)


$1 = get_email_address(“Sid”) $2 = get_email_address(“Lutfi”)
Ground Truth DAG
=
$1 = get_email_address(“Lutfi”) $2 = get_email_address(“Sid”) $3 = create_calendar_event([$1, $2], “4/24 2PM”)

Incorrectly Generated DAG (Score: 0)


$3 = create_calendar_event([$1, $2], “4/24 2PM”)


$1 = get_phone_number(“Lutfi”) $2 = get_email_address(“Sid”)

$3 = create_calendar_event([$1, $2], “4/24 2PM”)

Figure 2: Graph Isomorphism Success Rate. The model scores a success rate of 1 only if the DAG of its generated
plan is isomorphic to the DAG of the ground truth plan; and 0 otherwise. In the above example, for the top case,
although the order of the get_email_address calls are different from the ground truth plan (the ground truth plan
gets the email address of Lutfi before Sid, and the generated plan gets the email address of Sid before Lutfi), since
the two DAGs are isomorphic to each other, the plan gets 1 success rate. For the bottom case, since the predicted
DAG contains a wrong node, corresponding to a wrong function call, the plan gets 0 success rate.

approach is becoming a common method where a 3.3 Fine-tuning


capable LLM is instructed to generate data similar for Improved Function Calling Reasoning
to a given set of sample examples or templates. In
our work, we used a similar approach, but instead With our dataset in place, we can now proceed
of providing the LLM with generic user queries to fine-tune off-the-shelf SLMs to enhance their
as templates, we provide it with various sets of function calling capability. We started with two
functions and instruct it to generate realistic user base small models: TinyLlama-1.1B (instruct-32k)
queries that require those functions to accomplish and Wizard-2-7B. For fine-tuning these models, we
the task, along with the associated function calling first need to define a metric to evaluate their perfor-
plan and input arguments, like the example shown mance. Our objective is for these models to accu-
in Figure 1. To verify the validity of the generated rately generate the right plan, i.e., to select the right
data, we incorporated sanity checks on the function set of functions and to orchestrate them in the right
calling plan to make sure that they form a feasible order. Therefore, we define a success rate metric
graph, and that the function names and input that assigns 1 if both criteria are met, and 0 other-
argument types are correct. With this approach, we wise. Checking whether the model has selected the
created 80K training data, 1K validation data, and right set function calls is straightforward. To addi-
1K testing data, with a total cost of only ∼$500. tionally ensure that the orchestration of these func-
tions is correct, we construct a Directed Acyclic
Graph (DAG) of the function calls based on the de-
pendencies, as shown in Figure 2, where each node
represents a function call and a directed edge from

4
create_calendar_event
DeBERTa
User Input

Classification
compose_new_email
“Create a calendar invite

Layer N
Layer 1

Head
with Lutfi and Sid at 2pm summarize_pdf

tomorrow” reply_to_email


get_email_address

Figure 3: Overview of our Tool RAG scheme. We formulate tool retrieval as a multi-label classification problem.
The user query is given as input to the fine-tuned DeBERTa-v3-small model, which outputs a 16-dimensional vector
indicating tool probabilities. Tools with probabilities higher than 50% are selected, averaging 3.97 tools per query
compared to 6 tools in basic RAG.

create_calendar_event Using the above settings, we fine-tuned


compose_new_email
summarize_pdf TinyLlama-1.1B/Wizard-2-7B
Relevant in-contextmodels. examples andAfter tools
reply_to_email
User Input
get_zoom_meeting_link fine-tuning, the 1.1B model improved the success
ICE: Create a calendar invite with Nick at noon today
Which Tools maps_show_direction $1 = get_email_address(“Nick”)
“Create a calendar invite are Needed? create_note rate fromRetrieval
12.71% to 78.89%, and the 7B model
$2 = create_calendar_event([$1], “4/21 12PM”)
with Lutfi and Sid at 2pm forward_email
tomorrow” get_phone_number performance improved from 41.25% to 83.09%,
Tools: get_email_address, create_calendar_event
send_sms
open_note which is ∼4% higher than GPT-4-Turbo.
web_search
create_reminder
append_note_content
open_and_get_file_path 3.4 Efficient Inference with Tool RAG
maps_open_location
get_email_address LM
Our primary goal is to be able to deploy the TinyA-
Tools gent model locally on a Macbook, which has lim-
Figure 4: Efficient tool selection based on a user input. ited computational and memory resources available
Not all user inputs require all available tools; hence, it is as compared to the GPUs that closed-source mod-
imperative to select the right set of tools to minimize the
els like GPT are deployed on. To achieve efficient
prompt size and increase performance. In this case, the
LLM only needs the functions that get email addresses performance with low latency we need to ensure
and create a calendar event to accomplish its task. that not only is the model size small, but that the
input prompt is as concise as possible. The latter
is an important contributor to latency and compu-
node A to B represents their interdependency (i.e. tational resource consumption due to the quadratic
function B can only be executed after the execution complexity of attention on sequence length.
of function A). Then we compare if this DAG is The fine-tuned TinyAgent model discussed pre-
identical to that of the ground truth plan to verify viously was fine-tuned with the description of all
the accuracy of the dependencies. available tools in its prompt. However, we can sig-
After defining our evaluation metric, we applied nificantly reduce the prompt size by only including
LoRA (Hu et al., 2021) to fine-tune the models for the description of relevant tools based on the user
3 epochs using a learning rate of 7e-5 over the 80K query. For instance, consider the example shown in
training examples, and selected the best checkpoint Figure 4 above, where the user is asking to create
based on validation performance. For fine-tuning, a calendar invite with two people. In this case,
our prompt included not only the descriptions of the LLM only needs the functions that get email
the ground truth functions (i.e. functions used in addresses and create a calendar event in its prompt.
the ground truth plan) but also other irrelevant func- To take advantage of this observation, we need
tions as negative samples. We found the negative to determine which functions are required to
samples to be particularly effective for teaching accomplish the user’s command, which we refer
the model how to select appropriate tools for a to as Tool RAG given its similarity with how RAG
given query, hence improving the post-training works. However, the model performs poorly when
performance. Furthermore, we also include several we use a basic RAG method where we retrieve the
in-context examples demonstrating how queries are relevant tools based on the embedding similarity
translated into a function calling plans. These in- of the user query and the tools. This is because
context examples are selected through a Retrieval completing a user’s query often requires using
Augmented Generation (RAG) process based on several auxiliary tools which may be missed with
the user query from the data in the training dataset. a simple RAG method if the embedding of the

5
Table 1: Comparison of TinyAgent performance with DeBERTa to Basic RAG and no RAG settings. For Basic
RAG, we retrieved top-3 most relevant tools. For our fine-tuned DeBERTa-v3-small model, we retrieved tools with
a probability greater than 50%, which retrieves ∼3.97 tools per query.

Prompt Size TinyAgent 1.1B TinyAgent 7B


Tool RAG Method Tool Recall
(Tokens) Success Rate (%) Success Rate (%)
No RAG (all tools in the prompt) 1 2762 78.89 83.09
Basic RAG 0.949 1674 74.88 78.50
Fine-tuned DeBERTa-v3-small (Ours) 0.998 1397 80.06 84.95

Table 2: Latency, size, and success rate of TinyAgent models before and after quantization. Latency is the end-to-end
latency of the function calling planner, including the prompt processing time and generation.

Model Weight Precision Latency (seconds) Model Size (GB) Success Rate (%)
GPT-3.5 Unknown 3.2 Unknown 65.04
GPT-4-Turbo Unknown 3.9 Unknown 79.08
16 3.9 2.2 80.06
TinyAgent-1.1B
4 2.9 0.68 80.35
16 19.5 14.5 84.95
TinyAgent-7B
4 13.1 4.37 85.14

auxiliary tool is not similar to the user query. For 3.5 Fast Edge Deployment with Quantization
instance, the example shown in Figure 4 requires
calling get_email_address function even though Deploying models at the edge, such as on consumer
the user query is just asking about creating a MacBooks, can still be challenging even for small
calendar invitation. models with O(1B) parameters, since loading the
model parameters can consume a large portion
This can be addressed by treating the problem as
of the available memory. A solution to these
a classification of which tools are needed. To that
issues is quantization, which allows us to store
end, we fine-tuned a DeBERTa-v3-small (He et al.,
the model at a reduced bit precision. Quantization
2021) model on the training data to perform a 16-
not only reduces the storage requirements and
way classification as shown in Figure 3. The user
model footprint, but also cuts down the time and
query is given as an input to this model, and then we
resources needed to load model weights into mem-
pass the CLS token at the end through a simple fully
ory, thereby reducing the overall inference latency
connected layer of size 768x16 to transform it into a
as well. For more information on quantization,
16 dimensional vector (which is the total size of our
refer to (Gholami et al., 2022).
tools). The output of this layer is passed through a
sigmoid layer to produce the probability of select- To more efficiently deploy the models, we
ing each tool. During inference, we select the tools quantized the models into 4-bit with a group size of
that have probably higher than 50%, and if so, we 32, which is supported by the llama.cpp framework
include their description in the prompt. On average with quantization-aware training. As shown in
we noticed that only 3.97 tools are retrieved with a Table 2, the 4-bit models result in 30% better
recall of 0.998, whereas the basic RAG requires us- latency, along with a 4x reduction in the model
ing the top 6 tools to achieve a tool recall of 0.968. size. We also notice slight accuracy improvement
which is due to the additional fine-tuning with
We evaluated the model performance after
simulated quantization.
incorporating Tool RAG. The results are shown
in Table 1, where we report the performance of
the simple RAG system along with the fine-tuned 4 Putting It All Together
DeBERTa approach. As one can see, the DeBERTa
based Tool RAG method achieves almost perfect
We provide a demo video of the final TinyAgent-
recall performance, improves the baseline accuracy,
1.1B model deployed on a Macbook Pro M33 ,
while reducing the prompt size by ∼2x tokens.
which can be downloaded and tested on Mac

3
https://www.youtube.com/watch?v=0GvaGL9IDpQ

6
from the link4 . It not only runs all of the model abilities, ensuring that everyone can benefit from
inference locally on your computer, but it also TinyAgent’s capabilities without barriers.
allows you to provide commands through audio. Human Oversight: While TinyAgent demon-
We process the audio locally as well using the strates robust capabilities in function calling, the
Whisper-v3 (Radford et al., 2022) model from risk of hallucination and erroneous responses by
OpenAI deployed locally using the whisper.cpp LLMs remains (Zhang et al., 2023). To mitigate
framework. The greatest surprise for us was that this, it is essential to maintain human oversight
the accuracy of the 1.1B model exceeds that of throughout the operational loop, not just at the end-
GPT-4-Turbo, and is markedly fast while deployed point. This means integrating mechanisms for reg-
locally and privately on-device. ular checks and balances where humans can review,
override, or refine decisions made by TinyAgent.
5 Conclusions Future iterations of our system will aim to facilitate
even more seamless human-agent collaboration to
To summarize, we introduced TinyAgent and enhance decision accuracy and reliability.
showed that it is indeed possible to train a small
Cultural and Bias Considerations: Synthetic
language model and use it to power a semantic sys-
datasets generated using simple or naive prompts
tem that processes user queries. In particular, we
often carry inherent biases, such as those related
considered a Siri-like assistant for Mac as a driving
to regional or cultural specificity (Yu et al., 2024).
application. The key components for enabling it is
Because task-specific agent systems like TinyA-
to (i) teach off-the-shelf SLMs to perform function
gent rely on synthetic data, their effectiveness and
calling through LLMCompiler framework, (ii)
impartiality can be impacted when operating across
curate high quality function calling data for the
different demographic landscapes. In response,
task at hand, (iii) fine-tune the off-the-shelf model
we integrate diverse cultural data and demographic
on the generated data, and (iv) enable efficient
groups in our data generation processes to mitigate
deployment by optimizing the prompt size through
these biases. Our aim is to ensure that the synthetic
only retrieving the necessary tools based on the
data fueling TinyAgent is as inclusive and unbiased
user query through Tool RAG, as well as quantized
as possible, supporting a function-calling system
model deployment to reduce inference resource
that is culturally aware and equitably serves a
consumption. After these steps, our final models
global user base.
achieved 80.06% and 84.95% for the TinyAgent-
1.1.B and 7B models which exceed GPT-4-Turbo’s
Acknowledgements
success rate of 79.08% on this task.
We would like to thank Apple for sponsoring this
6 Ethics Statement project, as well as support from Microsoft through
Accelerating Foundation Models Research Pro-
Deploying TinyAgent to operate agentic systems gram. We also thank Sunjin Choi for his insights
at the edge presents several ethical considerations in energy cost associated with local and cloud
that are integral to our design and operational deployment. Our conclusions do not necessarily
philosophy. reflect the position or the policy of our sponsors,
Accessibility and Inclusivity: Ensuring that and no official endorsement should be inferred.
TinyAgent serves all users equitably, including
those with disabilities, is a priority. We are com-
mitted to designing interfaces that are universally References
accessible, incorporating features such as voice Yihan Cao, Yanbin Kang, Chi Wang, and Lichao
recognition that can understand diverse speech Sun. 2023. Instruction mining: When data mining
patterns and text-to-speech technologies that meets large language model finetuning. Preprint,
are clear and easily comprehensible. Further, arXiv:2307.06290.
we are exploring adaptive technologies that can Lichang Chen, Shiyang Li, Jun Yan, Hai Wang,
adjust to the specific needs of users with varying Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay
Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia
4 Jin. 2023. Alpagasus: Training a better alpaca with
https://github.com/SqueezeAILab/TinyAgent/
raw/main/TinyAgent.zip fewer data. Preprint, arXiv:2307.08701.

7
Wei Chen, Zhiyuan Li, and Mingyuan Ma. 2024. Amir Gholami. 2024b. Llm2llm: Boosting llms with
Octopus: On-device language model for function novel iterative data enhancement. arXiv preprint
calling of software apis. Preprint, arXiv:2404.01549. arXiv:2403.15042.

Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan
Huang, Yihan Cao, Zihao Wu, Lin Zhao, Shaochen Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang
Xu, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu, Mao, et al. 2024. Taskmatrix. ai: Completing tasks by
Hongmin Cai, Lichao Sun, Quanzheng Li, Dinggang connecting foundation models with millions of apis.
Shen, Tianming Liu, and Xiang Li. 2023. Auggpt: Intelligent Computing, 3:0063.
Leveraging chatgpt for text data augmentation.
Preprint, arXiv:2302.13007. Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan
Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward,
Yihe Deng, Weitong Zhang, Zixiang Chen, and and Yi Zhang. 2023. Tinygsm achieving 80%
Quanquan Gu. 2023. Rephrase and respond: Let large on gsm8k with small language models. Preprint,
language models ask better questions for themselves. arXiv:2312.09241.
Preprint, arXiv:2311.04205.
Jerry Liu. 2022. LlamaIndex.
Luyang Fang, Gyeong-Geon Lee, and Xiaoming Zhai.
Suhong Moon, Siddharth Jha, Lutfi Eren Erdogan,
2023. Using gpt-4 to augment unbalanced data for
Sehoon Kim, Woosang Lim, Kurt Keutzer, and Amir
automatic scoring. Preprint, arXiv:2310.18365.
Gholami. 2024. Efficient and scalable estimation of
Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and tool representations in vector space. arXiv preprint
Tushar Khot. 2023. Specializing smaller language arXiv:2409.02141.
models towards multi-step reasoning. arXiv preprint OpenAI. 2024. Hello gpt-4o. Accessed: 2024-07-29.
arXiv:2301.12726.
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E
Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Gonzalez. 2023. Gorilla: Large language model
Michael W Mahoney, and Kurt Keutzer. 2022. A connected with massive apis. arXiv preprint
survey of quantization methods for efficient neural arXiv:2305.15334.
network inference. In Low-Power Computer Vision,
pages 291–326. Chapman and Hall/CRC. Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal.
2023. Rephrase, augment, reason: Visual grounding
Google. 2024. Google gemini: Next generation model. of questions for vision-language models. Preprint,
Accessed: 2024-07-29. arXiv:2310.05861.
Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan
Debertav3: Improving deberta using electra-style Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang,
pre-training with gradient-disentangled embedding Bill Qian, et al. 2023. Toolllm: Facilitating large
sharing. arXiv preprint arXiv:2111.09543. language models to master 16000+ real-world apis.
arXiv preprint arXiv:2307.16789.
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman,
Chen. 2021. Lora: Low-rank adaptation of large Christine McLeavey, and Ilya Sutskever. 2022.
language models. arXiv preprint arXiv:2106.09685. Robust speech recognition via large-scale weak
supervision. Preprint, arXiv:2212.04356.
Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas
Lee, Michael W Mahoney, Kurt Keutzer, and Amir Christopher Rawles, Sarah Clinckemaillie, Yifan
Gholami. 2023. An llm compiler for parallel function Chang, Jonathan Waltz, Gabrielle Lau, Marybeth
calling. arXiv preprint arXiv:2312.04511. Fair, Alice Li, William Bishop, Wei Li, Folawiyo
Campbell-Ajala, Daniel Toyama, Robert Berry, Divya
Varun Kumar, Ashutosh Choudhary, and Eunah Cho. Tyamagundlu, Timothy Lillicrap, and Oriana Riva.
2020. Data augmentation using pre-trained trans- 2024a. Androidworld: A dynamic benchmarking
former models. In Proceedings of the 2nd Workshop environment for autonomous agents. Preprint,
on Life-long Learning for Spoken Language Systems, arXiv:2405.14573.
pages 18–26.
Christopher Rawles, Alice Li, Daniel Rodriguez,
Langchain. https://github.com/langchain-ai/langchain. Oriana Riva, and Timothy Lillicrap. 2024b. An-
droidinthewild: A large-scale dataset for android
Juyong Lee, Taywon Min, Minyong An, Changyeon device control. Advances in Neural Information
Kim, and Kimin Lee. 2024a. Benchmarking mobile Processing Systems, 36.
device control agents across diverse configurations.
arXiv preprint arXiv:2404.16660. Jingqing Ruan, Yihong Chen, Bin Zhang, Zhiwei Xu,
Tianpeng Bao, Guoqing Du, Shiwei Shi, Hangyu Mao,
Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Xingyu Zeng, and Rui Zhao. 2023. Tptu: Task plan-
Karttikeya Mangalam, Sheng Shen, Gopala Anu- ning and tool usage of large language model-based
manchipali, Michael W Mahoney, Kurt Keutzer, and ai agents. arXiv preprint arXiv:2308.03427.

8
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo
Roberta Raileanu, Maria Lomeli, Eric Hambro, Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. 2024b.
Luke Zettlemoyer, Nicola Cancedda, and Thomas Android in the zoo: Chain-of-action-thought for gui
Scialom. 2024. Toolformer: Language models can agents. Preprint, arXiv:2403.02713.
teach themselves to use tools. Advances in Neural
Information Processing Systems, 36. Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and
Wei Lu. 2024c. Tinyllama: An open-source small
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, language model. arXiv preprint arXiv:2401.02385.
Weiming Lu, and Yueting Zhuang. 2024. Hugginggpt:
Solving ai tasks with chatgpt and its friends in Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu,
hugging face. Advances in Neural Information Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang,
Processing Systems, 36. Yulong Chen, et al. 2023. Siren’s song in the ai ocean:
a survey on hallucination in large language models.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, arXiv preprint arXiv:2309.01219.
Amjad Almahairi, Yasmine Babaei, Nikolay Bash-
lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao
ale, et al. 2023. Llama 2: Open foundation and fine- Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu,
tuned chat models. arXiv preprint arXiv:2307.09288. Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis,
Luke Zettlemoyer, and Omer Levy. 2023. Lima: Less
Solomon Ubani, Suleyman Olcay Polat, and Rodney is more for alignment. Preprint, arXiv:2305.11206.
Nielsen. 2023. Zeroshotdataaug: Generating and
augmenting training data with chatgpt. arXiv preprint
arXiv:2304.14334.

Amazing Vince. 2024. Not-wizardlm-2-7b. Accessed:


2024-07-29.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa


Liu, Noah A Smith, Daniel Khashabi, and Hannaneh
Hajishirzi. 2022. Self-instruct: Aligning language
models with self-generated instructions. arXiv
preprint arXiv:2212.10560.

Lai Wei, Zihao Jiang, Weiran Huang, and Lichao


Sun. 2023. Instructiongpt-4: A 200-instruction
paradigm for fine-tuning minigpt-4. Preprint,
arXiv:2308.12067.

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak


Shafran, Karthik Narasimhan, and Yuan Cao. 2022.
React: Synergizing reasoning and acting in language
models. arXiv preprint arXiv:2210.03629.

Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-


Woo Lee, and Woomyeong Park. 2021. Gpt3mix:
Leveraging large-scale language models for text
augmentation. arXiv preprint arXiv:2104.08826.

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu,


Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo
Li, Adrian Weller, and Weiyang Liu. 2023. Metamath:
Bootstrap your own mathematical questions for large
language models. Preprint, arXiv:2309.12284.

Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng,


Alexander J Ratner, Ranjay Krishna, Jiaming Shen,
and Chao Zhang. 2024. Large language model as
attributed training data generator: A tale of diversity
and bias. Advances in Neural Information Processing
Systems, 36.

Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang,


Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin,
Saravan Rajmohan, Dongmei Zhang, and Qi Zhang.
2024a. Ufo: A ui-focused agent for windows os
interaction. Preprint, arXiv:2402.07939.

You might also like