Instructions
Instructions
The response has been prefilled with credible ratings, but please check and make
sure the justifications are clear and robust. If you feel that the
ratings/justifications are incorrect, please correct them.
Note: disregard the time in the descriptive context when rating the tasks.
Also do not use our current reference in time (i.e 2024) when rating the tasks. For
example, if google_search is called with query=”World Series Schedule” and the
results from 2023 show up. This is fine. The model is being run in a simulated
environment.
DO NOT RUN THE CODE. THE OUTPUT OF THE CODE IS ALREADY AVAILABLE.
Project Resources
Task Goals
Task Overview
Competitor Response
Task Workflow
Dimensions
Code Dimension
Code Quality
Response Dimension
Major Dimensions
Instruction Following
Contextual Awareness
Harmlessness
Truthfulness
Minor Dimensions
Collaborativity
SxS rating
SxS Justification
Project Resources
Tool Schema Docs
Task Goals
Understand the user’s needs based on the conversation.
Evaluate the quality of the Code for each model.
Evaluate the quality of the Response for each model.
Identify the better Response and justify why.
Task Overview
Each task will include a conversation between the user and the model, along with
two Code and Code Output and two Responses, which you will evaluate and compare.
The tools used by the model in this project are what we call Synthetic APIs. This
means that the outputs from the Synthetic APIs should be taken as a source of truth
and you should not attempt to fact check the outputs from the Synthetic APIs.
Instead you will check for issues in the simulation.
The user’s request needs to be determined from the entire conversation with the
most recent request as the main focus. Based on your interpretation of the user’s
request, you will evaluate each Code, Code Output and Response along specific
dimensions. It’s important to note that your evaluation and ratings of the Code
and Code Output section should not affect your evaluation of the Responses and vice
versa.
Competitor Response
In some tasks you will see the final response will look something like this
You will notice a lot of differences in the Code of the competitor response when
compared to our normal response. For example, all of the code steps are grouped
together, and then the code outputs, this is intended. You will also notice that
sometime that the Code in the competitor response will not have the full API tool
executions and it’s unclear what we should compare it to in the API schema. This is
also fine. This is because the model that generates the competitor response does
not call on the same tools. This means, we do not check the competitor response
Code against the API schema. Instead we want to look for:
Is the code throwing any errors?
Are there any parameter values being used that don't align with the prompt?
Is what the code doing makes sense with respect to the prompt?
Does the output of the code look like it would be useful in fulfilling the prompt?
Essentially, we’re checking if it looks reasonable. Just like how we rate the Code
of the normal response, we can rate the Code of the competitor response as having
issues depending on the severity of the issue.
Task Workflow
Read the conversation and the final prompt. Interpret the user’s expectations from
the model by putting yourself in the shoes of the user.
Prompt Analysis Guide
Analyze the Simulation, Code, and Responses.
Simulation, Code, and Response Analysis Guide
User: I’m planning a trip to New York in this weekend, can you help me find
direct flights that
Model: Sure! Here are some flights to New York departing before 8AM this weekend!
…list of flights…
The user is updating the flight destination with their last request. There are no
other instructions given to the model regarding the flight departure date or the
time. The model needs to carry over the “weekend” and “8AM” requirements from the
previous conversation history and find flights from San Francisco to Miami that
depart this weekend before 8AM.
User: Can you show me a list of 5 star hotels that are in New York?
…list of hotels…
User: Give me a list of 5 dinosaurs ranked by their size.
There is no previous conversation and the model only needs the final request to
fulfill the prompt, which is to find YouTube videos.
The goal is to analyze the conversation to identify user constraints and user
expectations. This step is extremely important since you will be assessing the
Code, Code Output and Responses based on your interpretation of the prompt. It’s
helpful to keep notes for tasks that have many constraints and/or requests. Try to
visualize the ideal execution steps the model should take and the ideal response
that would be fulfilling to the user.
The components of this project's task can be divided into two main categories: the
model and the simulation.
The Model is in charge of taking action, which entails writing the Tool Code and
generating the Response. On the other hand, the Code Output that comes after the
Tool Code is what we call The Simulation since it's a simulation of what might be
returned from the API. Again, this distinction is important: the model is in charge
of any action taken (i.e., writing the Tool Code and generating a Response),
whereas the Code Output of the tool calls is part of the simulation and not the
model's responsibility.
** When rating the competitor response, since the first simulation question should
be answered with respect to the API schema, you should mark that it is aligned.
This is because we do not check the competitor code against the API schema**
Dimensions
This means that each dimension justification requires very specific reasoning on
why this specific dimension was rated to have issues.
This also means that you should follow the rubrics below and you do not add your
own interpretation of it that is not listed in the rubrics. DO NOT bring rules from
different projects and apply them here.
If you have any doubts while rating a dimension, just come back to the instructions
and carefully read what the instructions say about that specific dimension. If you
still have doubts after, ask for help in the project channels.
Remember, when you are rating each dimension, only think about the dimension you
are currently rating. We are rating the tasks following the customer’s
specifications.
1) Do the outputs of the simulated API calls align with what is expected from the
API spec?
This question asks whether the Code Output aligns with the response schema in the
API spec. For example, if we call spothero.searchParking with valid parameters, we
would expect the API to return a list of ParkingSpots based on the API Schema.
Furthermore, the ParkingSpots returned should have only the fields listed in the
schema: it would be Somewhat Aligned to have an extra field named phoneNumber and
Not Aligned if none of the fields adhered to the schema. To illustrate another
example, the output is Not Aligned with the API Schema if searchParking returns a
single ParkingSpot rather than an array of ParkingSpots.
Completely Aligned
There are no discrepancies between what we see in the code output and what we
expect according to the API Schema.
Somewhat Aligned
There is a slight discrepancy with the API spec that doesn’t hinder the overall
usefulness of the output.
Not Aligned
The output does not adhere to the API spec at all. Refer to here for the full
schema.
Given the Tool Code, this question assesses whether the output is consistent with
the code. For example, if we call spothero.searchParking to search for parking
spots in Downtown Los Angeles on 2024-10-20, we expect the output to show spots in
Downtown Los Angeles on 2024-10-20. The output is Not Consistent if it lists
parking spots in New York for 2024-10-20 since an output with parking spots in New
York is completely useless. We could say that the output is Somewhat Consistent if
it shows parking spots for Downtown Los Angeles on 2024-10-21 since this at least
gives the user an idea of what parking spots exist in Downtown Los Angeles, making
the output partially useful.
Consistent
The code output is consistent with the code. Each field in the code output’s
response is as we would expect given the API calls.
Somewhat Consistent
There are some discrepancies with what we would expect the output to be given the
code, but the output is still somewhat helpful for the user’s intent.
Not Consistent
The output is completely misaligned and inconsistent with the code written.
As another example, let's say the code looks like the following:
Notice how the API call was made for classes in Austin, Texas, but the model
outputs classes that are in Paris, France. This is Not Consistent. Other scenarios
could include not adding a location for creating calendar events but the output
containing a specific location, the tool call asking for free events, but the
output showing only paid events, etc.
3) If there are multiple tool calls, are code outputs consistent with any prior
code output that came before it?
For example, if the first tool call returned a list of Whole Foods stores and the
second tool call tried to get store details using one of the store IDs, if the
output from the second tool call said the store was a Petco store, you should mark
"No" because the second tool call output is inconsistent with the first tool call
output. In other words, this question assesses the continuity and consistency of
the simulations (i.e., code outputs).
There is continuity and consistency with the code outputs. The outputs of the
previous steps are correctly utilized for subsequent steps.
No
Code Quality
!! Note: we no longer take the output into account when rating Code Quality. !!
This is an important caveat, so let us reiterate: we are assessing only the code
and ignoring the output/simulation. For example, if the tool call was
instacart.getStores(zipcode='51104') because the user is trying to find stores near
their zipcode in the United States, there are No Issues here if the output listed
stores in Denmark. This is because the Code was fine and captured the user's
intent. The output is the fault of the simulation and not the model.
Code Quality
No Issues
The code successfully captures as much of the user intent as possible given the
prompt and context, involving the correct tool(s), functions(s) and parameter(s).
Code comments effectively explain the purpose of the code; business logic, such as
loops, if-statements, etc., fulfills user requirements.
Minor Issues
The code partially satisfies the user intent given the prompt and context with the
tools, functions and parameters used. However, there may have been a better tool,
function, or parameter that would have better satisfied the intent of the user.
Major Issues
The code fails to satisfy the intent of the user and will not generate a useful
response.
Code does not match API schema. Refer to here for full details.
The code involves the incorrect tool, tool functions(s), and/or is missing multiple
critical parameters given the prompt & context.
Comments do not explain the purpose of the code; business logic does not fulfill
user requirements.
N/A
Response Dimensions
Response dimensions can be categorized into either Major or Minor dimension rating
categories. This means that the dimensions that are considered Major will carry
more weight when we eventually rate the response for `Overall Quality`.
Major dimensions
Instruction Following
Contextual Awareness
Harmlessness
Truthfulness
Minor Dimensions
Content Conciseness
Content Completeness
Collaborativity
Major Dimensions
Instruction Following
Instruction Following rates how well the response followed the implicit and
explicit intent from the prompt.
The user’s prompt represents the intent of the user to make the model perform a
certain task or part of a task.
Action intent: The user asks the model to perform an action that will have an
effect in the real world using a tool (e.g. send a message, set a reminder, book a
class, buy a product, etc)
QA intent: The user asks the model to retrieve information from a tool other than
Google Search. Some types of information in these tools include:
real-time/fresh/authoritative info, proprietary or personal data, and large
database information.
Instruction Following should be rated based on only what we see in the final
response and the prompt. Not Code. Only look at the final response when rating
Instruction Following. Do not look at anything other than the final response when
rating Instruction Following.
Do not confuse this with Truthfulness. As stated above, you should only be looking
at the final response and the prompt when rating Instruction Following. If you are
checking sources online, you are already looking at something other than the final
response or the prompt.
Instruction Following
No Issues
Response completely follows all instructions from the prompt, fully respecting the
user’s prompt.
Response completely follows all instructions from the prompt to the best of the
model’s ability and the response accurately states the outcome from following the
user’s instructions.
Minor Issues
Response follows most of the instructions from the prompt, satisfying the user’s
primary intent, but misses certain elements.
Response misses reasoning elements from the prompt that was intended to narrow,
organize, or format the response.
Major Issues
N/A
Note:
The response asking the user for confirmation or for additional details is not
immediately an issue for Instruction Following. This is especially true for prompts
that require the user to spend money. However, if the prompt clearly does not need
confirmation from the user, and the model asks, then this becomes an issue.
Contextual Awareness
Contextual Awareness rates how well the model remembers and applies the contextual
information available in the entire conversation.
Contextual Awareness
No Issues
Response consistently and effectively remembered and built upon information and
instructions from previous/current prompts and responses.
Minor Issues
The response generally remembered and built upon information and instructions from
all turns, but occasionally missed minor details, slightly misinterpreted prior
statements, or failed to incorporate certain information seamlessly.
Major Issues
Harmlessness
Harmlessness
No Issues
Minor Issues
Response contains minor safety guideline violations, but they are not highly
concerning.
Major Issues
Truthfulness
For Truthfulness, we are checking the correctness of all claims made by the model
in the response.
DO NOT OPEN LINKS THAT ARE IN THE CODE OUTPUT. WE EXPECT IT TO NOT WORK SINCE
SYNTHETIC APIs WILL RETURN SYNTHETIC DATA.
If the tool output is provided, and the tool output is used, it should be used as
THE ONLY SOURCE OF TRUTH when assessing the truthfulness of the final response.
Remember, the model is operating in an environment with synthetic APIs that return
synthetic data.
Primary factual claims: These factual claims are necessary to address the user
prompt and are usually explicitly requested in the user prompt.
Secondary claims: These factual claims aren’t necessary to address the user prompt
and might only be added by the model to enrich the response or to address an
implicit request from the user.
When the model makes a claim based on its internal knowledge instead of the tool
output.
When should you NOT use web evidence to check for truthfulness?
When the model uses tool output to generate the final response. In almost every
task you will NOT use web evidence to fact check the model.
Truthfulness
No Issues
Minor Issues
Primary claims are accurate, but at least one secondary claim is inaccurate,
unsupported, *or can be disputed by reputable web evidence.
*sometimes the model will choose not to use tools to make a primary factual claim.
Major Issues
*sometimes the model will choose not to use tools to make a primary factual claim.
Cannot Assess
Verifying the claims in the response would take more than 15 minutes.
N/A
Content Conciseness & Relevance checks if the final response contains only
necessary content and any additional content is clearly helpful and relevant and
not repetitive.
Do not penalize the model for adding suggestions or questions that guide the
conversation in a reasonable direction that aligns with the user’s greater goal.
No Issues
Response contains only necessary content. Each sentence is relevant to the prompt
and rich in value. Any additional summaries, suggestions, considerations, and
conversational questions are clearly helpful and relevant and not repetitive.
Minor Issues
Major Issues
N/A
Response is a punt.
Content Completeness
Content Completeness rates how well the response addresses the prompt. For example,
the model asking necessary clarification questions is considered addressing the
prompt. Unless it’s very clear that the model should not have asked for
clarification, do not penalize the response.
The model does not have to use information from the tool outputs. It has the option
to, or it can use other information that the model has in its internal knowledge.
Content Completeness
No Issues
The response gives enough information with sufficient detail to helpfully fulfill
the prompt; there is no important and relevant content missing.
Minor Issues
The model response omits some information that would be helpful but it's not
explicitly requested by the user.
(e.g. when searching for information, the model doesn’t include entity
descriptions or other information (e.g. price, location) that might help the user
choose between options)
Major Issues
Relevant content is missing to such an extent that the response does not at all
fulfill the user’s intent.
The model doesn’t include all the information the user explicitly asked for. Even
if one minor piece is missing , the response is incomplete.
When the user asks for recommendations, suggestions or options, the model offers
only one recommendation.
The model makes an assumption about an ambiguous statement and doesn’t inform the
user about making that assumption.
(e.g. The user asks for asian food restaurants, and the model only searches for
Japanese restaurants, and the response states “Here are some asian food
restaurants” and only lists Japanese restaurants.)
N/A
Response is a punt.
No Issues
Response is written and organized such that it’s easy to understand and take next
steps. Response is communicated in a natural-sounding, conversational tone that
makes it engaging. Response does not preach at or lecture the user.
Minor Issues
Response has minor issues of writing quality, such as being stilted or unnatural.
Phrasing could be more concise or appropriate for the conversational context.
Response may contain some stylistic issues that reduce how engaging it is, or be
overly formatted in a distracting way (e.g. unnecessarily nested bullet points or
over bolding).
The model uses commonplace expressions to add color to the response (e.g.
“Absolutely!”, “I hope this helps”, “ I can help with that”)
A small portion of the model response uses complex verbiage, jargon and/or formal
language (e.g. endpoint, system error, retrieve vs. find, dispatch vs. send)
Major Issues
Most of the model response uses complex verbiage, jargon and/or formal language
(e.g. endpoint, system error, retrieve vs. find, dispatch vs. send)
The model explains something that it is doing or happened in the backend giving too
much technical detail.
The model makes assumptions about the user's activities, habits, or actions in the
real world (e.g. you spent $X, so you probably got a good deal.)
The model uses qualifiers about itself or makes statements about its abilities that
might not be true to reality (e.g. I have used the most trusted sources.)
The response is organized in a confusing way. For example, the sections don’t seem
to follow a logical order, sections have confusing or misleading titles, the
formatting chosen (e.g. bullet point list) is not easy to read or the information
would have looked better with other types of formatting (e.g. table.)
Collaborativity
Collaborativity should be rated on only what we see in the final response and the
prompt. Not Code.
Users can have two kinds of intents: Pure Action Intents and QA Intents.
Pure Action Intents are instructions from the user that can be fulfilled in a
single turn.
Explanation: The user is asking the model to perform a specific action using a
specific tool, and the user’s expectation is clear on what the model should return.
QA Intents are exploratory searches for information that should benefit from the
chatbot being collaborative.
User: It’s my anniversary with my wife in December. She likes romantic places and
likes visiting the
coast. Could you make some recommendations and find places to stay
including
Transportation?
Explanation: The user is looking for recommendations and asking the model for help
in finding suggestions. The model should present suggestions and be a collaborative
partner for the user towards the greater goal.
Collaborativity
No Issues
Minor Issues
The AI Assistant generally acted as a collaborative partner, but there were a few
instances where it could have been more proactive or helpful in this response. For
example, it might have missed an opportunity to offer a suggestion, or its follow-
up questions were not entirely on point or too generic. Despite this, this response
still contributed to the overall collaborative effort and assisted in moving
towards the broader goal.
The model asks follow-up questions that are irrelevant to the task.
The model makes an assumption about an ambiguous statement that doesn’t impact the
helpfulness, relevance or factual accuracy of the response.
The model keeps the conversation open at the end of the response but asks a
question that is too generic (e.g. “Is there anything else I can do for you?”
instead of correlating it to the topic of the response)
Major Issues
This response has major issues that make the AI Assistant feel uncooperative. It is
completely missing needed suggestions or follow-up questions, or did not actively
participate in determining next steps. The AI may have focused primarily on
responding to the immediate query without considering the user's overall goal, and
seems to be trying to end the conversation.
The model fails to ask the user for necessary information to complete the task OR
fails to ask for the user’s clarification when the prompt is ambiguous.
(elicitation failure.)
The model makes assumptions about what the user wants to do next or pushes the user
to follow a certain path in subsequent turns.
N/A
This dimension rates the overall quality of the Response. Not Code.
The only reason for the tool code output to indirectly influence Overall Response
Quality is when it impacts Truthfulness.
Overall Quality
Cannot be improved
Response has at most one minor issue in a minor dimension AND no major issues.
Okay
Response has minor issues in several or every minor category OR at least one major
issue in one minor category OR minor issues in one major category.
Pretty bad
Response has minor OR major issues in several minor categories AND/OR a major issue
in one major rating category.
Horrible
When comparing the two responses, consider how you assessed them individually
across all of the Response dimensions.
SxS rating
After evaluating both Responses, you will select the better response using the
response selector, and provide a SxS score to specify to what extent one response
is better over the other.
Use the ratings from the response dimension ratings to guide your decision. The
response with the lower Overall Quality score should not be considered better than
the other. Double check that the response you select aligns with the score given on
the SxS scale.
A response that partially satisfies the user’s request is and factual is ALWAYS
better or much better than a response that completely addresses the prompt but is
entirely false.
SxS Justification
Explain your reasoning behind the SxS score you chose. Remember, you should only
reference the tool code or tool output if truthfulness was a factor in deciding
your score.
Example:
@Response 2 is better than @Response 1 because @Response 2 gives the user the
answer to their mathematical equation while also pointing out the major highlights
of the response using bolded words. Both responses answer the user's prompt, but
@Response 2 provides a better, more understandable response and gives the user the
option to ask another question by ending the response with "Would you like to
explore another problem or concept related to complex numbers or the FFT".
@Response 2 has a thorough explanation of the equation but highlights the key
takeaways, which the user would find beneficial. @Response 1 provides the same
answer as @Response 2, however @Response 1 has a more complex explanation that the
user may find not as clear and harder to understand.ˇ