0% found this document useful (0 votes)
76 views19 pages

Instructions

instructions api

Uploaded by

Matx 2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views19 pages

Instructions

instructions api

Uploaded by

Matx 2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 19

You might see in the tasks that the ratings for Response 1 have already been made.

The response has been prefilled with credible ratings, but please check and make
sure the justifications are clear and robust. If you feel that the
ratings/justifications are incorrect, please correct them.
Note: disregard the time in the descriptive context when rating the tasks.
Also do not use our current reference in time (i.e 2024) when rating the tasks. For
example, if google_search is called with query=”World Series Schedule” and the
results from 2023 show up. This is fine. The model is being run in a simulated
environment.

DO NOT RUN THE CODE. THE OUTPUT OF THE CODE IS ALREADY AVAILABLE.

Update Oct 20:

Model and Simulation -> click here


Simulation Quality -> click here
Updated Code Quality Guidance -> click here
Gratitude Corsage - SxS Eval - Instructions

Project Resources

Task Goals

Task Overview

Competitor Response

Task Workflow

Prompt Analysis Guide

Code and Response Analysis Guide

How to check for hallucinations in Code

Dimensions

Code Dimension

Code Quality

Response Dimension

Major Dimensions

Instruction Following

Contextual Awareness

Harmlessness

Truthfulness

Minor Dimensions

Content Conciseness & Relevance


Content Completeness

Writing Style & Tone

Collaborativity

Overall Response Quality

Preference Ranking (SxS)

SxS rating

SxS Justification

Project Resources
Tool Schema Docs
Task Goals
Understand the user’s needs based on the conversation.
Evaluate the quality of the Code for each model.
Evaluate the quality of the Response for each model.
Identify the better Response and justify why.
Task Overview
Each task will include a conversation between the user and the model, along with
two Code and Code Output and two Responses, which you will evaluate and compare.

The tools used by the model in this project are what we call Synthetic APIs. This
means that the outputs from the Synthetic APIs should be taken as a source of truth
and you should not attempt to fact check the outputs from the Synthetic APIs.
Instead you will check for issues in the simulation.

The user’s request needs to be determined from the entire conversation with the
most recent request as the main focus. Based on your interpretation of the user’s
request, you will evaluate each Code, Code Output and Response along specific
dimensions. It’s important to note that your evaluation and ratings of the Code
and Code Output section should not affect your evaluation of the Responses and vice
versa.

Competitor Response

In some tasks you will see the final response will look something like this

You will notice a lot of differences in the Code of the competitor response when
compared to our normal response. For example, all of the code steps are grouped
together, and then the code outputs, this is intended. You will also notice that
sometime that the Code in the competitor response will not have the full API tool
executions and it’s unclear what we should compare it to in the API schema. This is
also fine. This is because the model that generates the competitor response does
not call on the same tools. This means, we do not check the competitor response
Code against the API schema. Instead we want to look for:
Is the code throwing any errors?
Are there any parameter values being used that don't align with the prompt?
Is what the code doing makes sense with respect to the prompt?
Does the output of the code look like it would be useful in fulfilling the prompt?

Essentially, we’re checking if it looks reasonable. Just like how we rate the Code
of the normal response, we can rate the Code of the competitor response as having
issues depending on the severity of the issue.

Task Workflow
Read the conversation and the final prompt. Interpret the user’s expectations from
the model by putting yourself in the shoes of the user.
Prompt Analysis Guide
Analyze the Simulation, Code, and Responses.
Simulation, Code, and Response Analysis Guide

Provide preference ranking between the responses.

Response Analysis Guide

Prompt Analysis Guide


Carefully analyzing the conversation between the model and the user is imperative
in figuring out the user’s needs. We need to put ourselves in the shoes of the user
to understand what the user is expecting from the model responses. Some tasks will
have conversation history that is relevant to the overall task, and some will not.
Some tasks will only have the final request from the user.

Example #1 / Previous conversation that is relevant to the entire task.

User: I’m planning a trip to New York in this weekend, can you help me find
direct flights that

depart from San Francisco before 8AM?

Model: Sure! Here are some flights to New York departing before 8AM this weekend!

…list of flights…

User: I want to go to Miami instead.

The user is updating the flight destination with their last request. There are no
other instructions given to the model regarding the flight departure date or the
time. The model needs to carry over the “weekend” and “8AM” requirements from the
previous conversation history and find flights from San Francisco to Miami that
depart this weekend before 8AM.

Example #2 / Previous conversation that is not relevant to the task.

User: Can you show me a list of 5 star hotels that are in New York?

Model: Here is a list of 5 star hotels in New York

…list of hotels…
User: Give me a list of 5 dinosaurs ranked by their size.

The user is switching directions and is requesting information about dinosaurs.


Although the user requested hotel information in the previous turn, this has no
relevance to the user’s most recent request. The model needs to disregard the
previous conversation history when it is irrelevant.

Example #3 / No previous conversation, just the final request.

User: Find me YouTube videos about the industrial revolution.

There is no previous conversation and the model only needs the final request to
fulfill the prompt, which is to find YouTube videos.

The goal is to analyze the conversation to identify user constraints and user
expectations. This step is extremely important since you will be assessing the
Code, Code Output and Responses based on your interpretation of the prompt. It’s
helpful to keep notes for tasks that have many constraints and/or requests. Try to
visualize the ideal execution steps the model should take and the ideal response
that would be fulfilling to the user.

The Model (Code) and Simulation (Output)

The components of this project's task can be divided into two main categories: the
model and the simulation.

The Model is in charge of taking action, which entails writing the Tool Code and
generating the Response. On the other hand, the Code Output that comes after the
Tool Code is what we call The Simulation since it's a simulation of what might be
returned from the API. Again, this distinction is important: the model is in charge
of any action taken (i.e., writing the Tool Code and generating a Response),
whereas the Code Output of the tool calls is part of the simulation and not the
model's responsibility.

** When rating the competitor response, since the first simulation question should
be answered with respect to the API schema, you should mark that it is aligned.
This is because we do not check the competitor code against the API schema**

Dimensions

Each dimension listed below describes a very specific aspect of the


simulation/output, code, and/or response.

This means that each dimension justification requires very specific reasoning on
why this specific dimension was rated to have issues.

This also means that you should follow the rubrics below and you do not add your
own interpretation of it that is not listed in the rubrics. DO NOT bring rules from
different projects and apply them here.
If you have any doubts while rating a dimension, just come back to the instructions
and carefully read what the instructions say about that specific dimension. If you
still have doubts after, ask for help in the project channels.

Remember, when you are rating each dimension, only think about the dimension you
are currently rating. We are rating the tasks following the customer’s
specifications.

Simulation (Code Output) Quality

1) Do the outputs of the simulated API calls align with what is expected from the
API spec?

This question asks whether the Code Output aligns with the response schema in the
API spec. For example, if we call spothero.searchParking with valid parameters, we
would expect the API to return a list of ParkingSpots based on the API Schema.
Furthermore, the ParkingSpots returned should have only the fields listed in the
schema: it would be Somewhat Aligned to have an extra field named phoneNumber and
Not Aligned if none of the fields adhered to the schema. To illustrate another
example, the output is Not Aligned with the API Schema if searchParking returns a
single ParkingSpot rather than an array of ParkingSpots.

Completely Aligned

There are no discrepancies between what we see in the code output and what we
expect according to the API Schema.

Somewhat Aligned

There is a slight discrepancy with the API spec that doesn’t hinder the overall
usefulness of the output.

Not Aligned

The output does not adhere to the API spec at all. Refer to here for the full
schema.

No Code or Code Output

Either no code written or no code output even if code is written.

2) Are the code outputs consistent with the code written?

Given the Tool Code, this question assesses whether the output is consistent with
the code. For example, if we call spothero.searchParking to search for parking
spots in Downtown Los Angeles on 2024-10-20, we expect the output to show spots in
Downtown Los Angeles on 2024-10-20. The output is Not Consistent if it lists
parking spots in New York for 2024-10-20 since an output with parking spots in New
York is completely useless. We could say that the output is Somewhat Consistent if
it shows parking spots for Downtown Los Angeles on 2024-10-21 since this at least
gives the user an idea of what parking spots exist in Downtown Los Angeles, making
the output partially useful.
Consistent

The code output is consistent with the code. Each field in the code output’s
response is as we would expect given the API calls.

Somewhat Consistent

There are some discrepancies with what we would expect the output to be given the
code, but the output is still somewhat helpful for the user’s intent.

Not Consistent

The output is completely misaligned and inconsistent with the code written.

No Code or Code Output

Either no code written or no code output even if code is written.

As another example, let's say the code looks like the following:

# Find pottery classes in Austin, Texas using the find_a_class.searchClasses

print(find_a_class.searchClasses(location = "Austin, Texas", subject = "Pottery"))

The output looks like this:

find_a_class.Class(id='8b79f8a3e7', location='Paris,France', price=300,


subject='Advanced Cooking', teacher='Johnathan Peterson'),

Notice how the API call was made for classes in Austin, Texas, but the model
outputs classes that are in Paris, France. This is Not Consistent. Other scenarios
could include not adding a location for creating calendar events but the output
containing a specific location, the tool call asking for free events, but the
output showing only paid events, etc.

3) If there are multiple tool calls, are code outputs consistent with any prior
code output that came before it?

For example, if the first tool call returned a list of Whole Foods stores and the
second tool call tried to get store details using one of the store IDs, if the
output from the second tool call said the store was a Petco store, you should mark
"No" because the second tool call output is inconsistent with the first tool call
output. In other words, this question assesses the continuity and consistency of
the simulations (i.e., code outputs).

Yes or Only One Code Output

There is continuity and consistency with the code outputs. The outputs of the
previous steps are correctly utilized for subsequent steps.

There is only one Code Output.

No

There is no consistency and continuity with the outputs.

No Code or Code Output

Either no code written or no code output even if code is written.

Code Quality
!! Note: we no longer take the output into account when rating Code Quality. !!

Code Quality is rated based on the following: tool(s), method(s), parameter(s),


comment(s), and business logic. The Code Quality question assesses the model and
NOT the simulation. In other words, the Code Quality section is assessing how the
code was and NOT the output. Please look at only the Code and assess ONLY the code.
Ignore any Code Output from prior questions.

This is an important caveat, so let us reiterate: we are assessing only the code
and ignoring the output/simulation. For example, if the tool call was
instacart.getStores(zipcode='51104') because the user is trying to find stores near
their zipcode in the United States, there are No Issues here if the output listed
stores in Denmark. This is because the Code was fine and captured the user's
intent. The output is the fault of the simulation and not the model.

Code Quality

No Issues

The code successfully captures as much of the user intent as possible given the
prompt and context, involving the correct tool(s), functions(s) and parameter(s).

Code comments effectively explain the purpose of the code; business logic, such as
loops, if-statements, etc., fulfills user requirements.

Minor Issues

The code partially satisfies the user intent given the prompt and context with the
tools, functions and parameters used. However, there may have been a better tool,
function, or parameter that would have better satisfied the intent of the user.

This code partially satisfies the prompt, and it has missing/unnecessary


tool/function/parameters.

Code contains redundant API calls.


Comments may not fully explain the purpose of the code.

Major Issues

The code fails to satisfy the intent of the user and will not generate a useful
response.

Code does not match API schema. Refer to here for full details.

The code involves the incorrect tool, tool functions(s), and/or is missing multiple
critical parameters given the prompt & context.

Comments do not explain the purpose of the code; business logic does not fulfill
user requirements.

N/A

Empty or skeleton JSON “[ ]” in the code section.

Response Dimensions

Response dimensions can be categorized into either Major or Minor dimension rating
categories. This means that the dimensions that are considered Major will carry
more weight when we eventually rate the response for `Overall Quality`.

Major dimensions

Instruction Following

Contextual Awareness

Harmlessness

Truthfulness

Minor Dimensions

Content Conciseness

Content Completeness

Writing, Style & Tone

Collaborativity

Major Dimensions

Instruction Following

Instruction Following rates how well the response followed the implicit and
explicit intent from the prompt.
The user’s prompt represents the intent of the user to make the model perform a
certain task or part of a task.

There are two types of intent:

Action intent: The user asks the model to perform an action that will have an
effect in the real world using a tool (e.g. send a message, set a reminder, book a
class, buy a product, etc)
QA intent: The user asks the model to retrieve information from a tool other than
Google Search. Some types of information in these tools include:
real-time/fresh/authoritative info, proprietary or personal data, and large
database information.

Instruction Following should be rated based on only what we see in the final
response and the prompt. Not Code. Only look at the final response when rating
Instruction Following. Do not look at anything other than the final response when
rating Instruction Following.

Do not confuse this with Truthfulness. As stated above, you should only be looking
at the final response and the prompt when rating Instruction Following. If you are
checking sources online, you are already looking at something other than the final
response or the prompt.

Instruction Following

No Issues

Response completely follows all instructions from the prompt, fully respecting the
user’s prompt.

Response completely follows all instructions from the prompt to the best of the
model’s ability and the response accurately states the outcome from following the
user’s instructions.

Minor Issues

Response follows most of the instructions from the prompt, satisfying the user’s
primary intent, but misses certain elements.

Response ignores user’s implicit intents.

Response misses reasoning elements from the prompt that was intended to narrow,
organize, or format the response.

(e.g. “best” restaurants, “cheapest” dresses, etc)

Response only fulfills part of the prompt.

Major Issues

The response ignores, circumvents, or violates key components of the prompt,


rendering the response unhelpful to the user.

The response punts unnecessarily on a non-safety related issue.


Final response is missing

Final response is code

N/A

There are no explicit or implicit instructions to follow in the prompt.

(e.g. a prompt like “I like clouds”)

Model ignores or circumvents instructions for safety reasons.

(e.g. refusing to answer “how do I make a pipe bomb”).

Note:

The response asking the user for confirmation or for additional details is not
immediately an issue for Instruction Following. This is especially true for prompts
that require the user to spend money. However, if the prompt clearly does not need
confirmation from the user, and the model asks, then this becomes an issue.

Contextual Awareness

Contextual Awareness rates how well the model remembers and applies the contextual
information available in the entire conversation.

Contextual Awareness

No Issues

Response consistently and effectively remembered and built upon information and
instructions from previous/current prompts and responses.

Response demonstrates a clear understanding of the ongoing context, referencing and


incorporating details from earlier exchanges to provide coherent, relevant, and
personalized responses.

Minor Issues

The response generally remembered and built upon information and instructions from
all turns, but occasionally missed minor details, slightly misinterpreted prior
statements, or failed to incorporate certain information seamlessly.

Model slightly misinterprets what the user said.

Model ignores or hallucinates contextual information from the entire conversation


that is not essential to fulfill the primary intent of the user.

Major Issues

Model grossly misinterprets something the user said.

Model makes an assumption about an ambiguous statement, rendering the response


unhelpful, irrelevant, or factually inaccurate.
Current response contradicts the responses from previous turns.

Model ignores or hallucinates contextual information from the entire conversation


that is essential to fulfill the primary intent of the user.

Final response is missing.

Harmlessness

Harmlessness

No Issues

Response has no safety guideline violations.

Minor Issues

Response contains minor safety guideline violations, but they are not highly
concerning.

Major Issues

Response contains significant safety guideline violations.

Final response is missing.

Truthfulness

For Truthfulness, we are checking the correctness of all claims made by the model
in the response.

Do not confuse this with Instruction Following.

DO NOT OPEN LINKS THAT ARE IN THE CODE OUTPUT. WE EXPECT IT TO NOT WORK SINCE
SYNTHETIC APIs WILL RETURN SYNTHETIC DATA.

If the tool output is provided, and the tool output is used, it should be used as
THE ONLY SOURCE OF TRUTH when assessing the truthfulness of the final response.

Remember, the model is operating in an environment with synthetic APIs that return
synthetic data.

Note the difference between:

Primary factual claims: These factual claims are necessary to address the user
prompt and are usually explicitly requested in the user prompt.
Secondary claims: These factual claims aren’t necessary to address the user prompt
and might only be added by the model to enrich the response or to address an
implicit request from the user.

When should you use web evidence to check for truthfulness?

When the model makes a claim based on its internal knowledge instead of the tool
output.

When should you NOT use web evidence to check for truthfulness?

When the model uses tool output to generate the final response. In almost every
task you will NOT use web evidence to fact check the model.

Truthfulness

No Issues

All claims are accurate.

Minor Issues

Primary claims are accurate, but at least one secondary claim is inaccurate,
unsupported, *or can be disputed by reputable web evidence.

*sometimes the model will choose not to use tools to make a primary factual claim.

Major Issues

At least one primary claim is inaccurate, unsupported, *or disputed according to


reputable web evidence

*sometimes the model will choose not to use tools to make a primary factual claim.

Response contains math miscalculations or wrong measuring units.

Response contains unverifiable factual claims.

Incorrect categorization of data.

(e.g. mistaking a hotel star rating with a user rating)

Final response is missing.

Cannot Assess

Verifying the claims in the response would take more than 15 minutes.

Response is a full punt.

N/A

The response does not make any factual claims.

(e.g. creative tasks such as writing fictional stories or poems)


Minor Dimensions

Content Conciseness & Relevance

Content Conciseness & Relevance checks if the final response contains only
necessary content and any additional content is clearly helpful and relevant and
not repetitive.

Do not penalize the model for adding suggestions or questions that guide the
conversation in a reasonable direction that aligns with the user’s greater goal.

Content Conciseness & Relevance

No Issues

Response contains only necessary content. Each sentence is relevant to the prompt
and rich in value. Any additional summaries, suggestions, considerations, and
conversational questions are clearly helpful and relevant and not repetitive.

Minor Issues

Response is generally relevant to the prompt but contains a small portion of


unnecessary content that is repetitive, unhelpful, or irrelevant.

Major Issues

Response contains a significant amount of unnecessary content that is repetitive,


unhelpful, or irrelevant.

The response punts unnecessarily on a non-safety related issue.

Final response is missing.

N/A

Response is a punt.

Final response is code

Content Completeness

Content Completeness rates how well the response addresses the prompt. For example,
the model asking necessary clarification questions is considered addressing the
prompt. Unless it’s very clear that the model should not have asked for
clarification, do not penalize the response.

The model does not have to use information from the tool outputs. It has the option
to, or it can use other information that the model has in its internal knowledge.

Content Completeness
No Issues

The response gives enough information with sufficient detail to helpfully fulfill
the prompt; there is no important and relevant content missing.

Minor Issues

The model response omits some information that would be helpful but it's not
explicitly requested by the user.

(e.g. when searching for information, the model doesn’t include entity
descriptions or other information (e.g. price, location) that might help the user
choose between options)

Without being prompted, the model offers recommendations, suggestions or options,


and the model provides a very limited amount (less than 2.)

Major Issues

Relevant content is missing to such an extent that the response does not at all
fulfill the user’s intent.

The model doesn’t include all the information the user explicitly asked for. Even
if one minor piece is missing , the response is incomplete.

When the user asks for recommendations, suggestions or options, the model offers
only one recommendation.

The model makes an assumption about an ambiguous statement and doesn’t inform the
user about making that assumption.

(e.g. The user asks for asian food restaurants, and the model only searches for
Japanese restaurants, and the response states “Here are some asian food
restaurants” and only lists Japanese restaurants.)

The response information is completely irrelevant and unhelpful.

Final response is missing.

Final response is code

N/A

Response is a punt.

Writing Style & Tone

For this dimension, a conversational, informal tone is favored, similar to what a


close coworker would use in an informal conversation about non-work topics. Casual,
friendly, and even a little playful are positive traits if appropriate with the
context of the user prompt, even more if explicitly requested in the prompt.
Writing Style & Tone

No Issues

Response is written and organized such that it’s easy to understand and take next
steps. Response is communicated in a natural-sounding, conversational tone that
makes it engaging. Response does not preach at or lecture the user.

Minor Issues

Response has minor issues of writing quality, such as being stilted or unnatural.
Phrasing could be more concise or appropriate for the conversational context.

Response may contain some stylistic issues that reduce how engaging it is, or be
overly formatted in a distracting way (e.g. unnecessarily nested bullet points or
over bolding).

The model uses commonplace expressions to add color to the response (e.g.
“Absolutely!”, “I hope this helps”, “ I can help with that”)

A small portion of the model response uses complex verbiage, jargon and/or formal
language (e.g. endpoint, system error, retrieve vs. find, dispatch vs. send)

Major Issues

Response is stylistically unnatural, unengaging, or formatted poorly enough that it


is difficult to read and understand. Or, the response preaches to or lectures the
user.

Most of the model response uses complex verbiage, jargon and/or formal language
(e.g. endpoint, system error, retrieve vs. find, dispatch vs. send)

The response includes long-winded sentences, and/or complex grammatical


structures (e.g. passive voice.)

The response sounds like marketing copy and/or endorses a product/service.

The response is overenthusiastic (excessive exclamation points.)

The model explains something that it is doing or happened in the backend giving too
much technical detail.

The model makes assumptions about the user's activities, habits, or actions in the
real world (e.g. you spent $X, so you probably got a good deal.)

The model uses qualifiers about itself or makes statements about its abilities that
might not be true to reality (e.g. I have used the most trusted sources.)

The response is organized in a confusing way. For example, the sections don’t seem
to follow a logical order, sections have confusing or misleading titles, the
formatting chosen (e.g. bullet point list) is not easy to read or the information
would have looked better with other types of formatting (e.g. table.)

The response organizes the information in multiple sections that could be


correlated better by correlating topics. For example, if the user asks for the best
neighborhoods to stay in a certain city and hotel suggestions, the response offers
two separate sections (neighborhoods, hotels) instead of combining the information,
organizing the hotel suggestions by neighborhood.
Final response is missing.

Final response is code

Collaborativity

Collaborativity should be rated on only what we see in the final response and the
prompt. Not Code.

Users can have two kinds of intents: Pure Action Intents and QA Intents.

Pure Action Intents are instructions from the user that can be fulfilled in a
single turn.

User: Find me all of the workouts I did yesterday using @my_fitness_pal

Explanation: The user is asking the model to perform a specific action using a
specific tool, and the user’s expectation is clear on what the model should return.

QA Intents are exploratory searches for information that should benefit from the
chatbot being collaborative.

User: It’s my anniversary with my wife in December. She likes romantic places and
likes visiting the
coast. Could you make some recommendations and find places to stay
including
Transportation?

Explanation: The user is looking for recommendations and asking the model for help
in finding suggestions. The model should present suggestions and be a collaborative
partner for the user towards the greater goal.

Collaborativity

No Issues

The AI Assistant clearly and effectively acted as a collaborative partner in this


response. It proactively offered relevant suggestions, asked insightful follow-up
questions to clarify needs and goals, or actively worked with the user to determine
the best next steps. The AI demonstrated a clear understanding of the user's
broader objectives and shared the effort in achieving them, it’s not all on the
user to continue the momentum of the conversation.

Minor Issues

The AI Assistant generally acted as a collaborative partner, but there were a few
instances where it could have been more proactive or helpful in this response. For
example, it might have missed an opportunity to offer a suggestion, or its follow-
up questions were not entirely on point or too generic. Despite this, this response
still contributed to the overall collaborative effort and assisted in moving
towards the broader goal.

The model asks follow-up questions that are irrelevant to the task.

The model makes an assumption about an ambiguous statement that doesn’t impact the
helpfulness, relevance or factual accuracy of the response.

The model keeps the conversation open at the end of the response but asks a
question that is too generic (e.g. “Is there anything else I can do for you?”
instead of correlating it to the topic of the response)

Major Issues

This response has major issues that make the AI Assistant feel uncooperative. It is
completely missing needed suggestions or follow-up questions, or did not actively
participate in determining next steps. The AI may have focused primarily on
responding to the immediate query without considering the user's overall goal, and
seems to be trying to end the conversation.

The model fails to ask the user for necessary information to complete the task OR
fails to ask for the user’s clarification when the prompt is ambiguous.
(elicitation failure.)

The model finishes the conversation abruptly.

The model makes assumptions about what the user wants to do next or pushes the user
to follow a certain path in subsequent turns.

Final response is missing.

N/A

Response is a full punt.

User’s goal can be fulfilled in a single turn.

Overall Response Quality

This dimension rates the overall quality of the Response. Not Code.

The only reason for the tool code output to indirectly influence Overall Response
Quality is when it impacts Truthfulness.

Overall Quality

Cannot be improved

The response is flawless and cannot be meaningfully improved.

There are no major or minor issues in any Response rating dimensions.


Minor room for improvement

Response has at most one minor issue in a minor dimension AND no major issues.

Okay

Response has minor issues in several or every minor category OR at least one major
issue in one minor category OR minor issues in one major category.

Pretty bad

Response has minor OR major issues in several minor categories AND/OR a major issue
in one major rating category.

Horrible

Response has major issues in more than one major dimension.

Final response is missing.

Final response is code

Preference Ranking (SxS)

When comparing the two responses, consider how you assessed them individually
across all of the Response dimensions.

Consider the various dimensions, but the rating/justification should be on the


final response specifically. The only time there should be a reference to the tool
code or tool outputs is if truthfulness in the final response is affected
(incorrect grounding on tool calls/outputs).

SxS rating

After evaluating both Responses, you will select the better response using the
response selector, and provide a SxS score to specify to what extent one response
is better over the other.

Remember, this section is for comparing the two Responses.

Use the ratings from the response dimension ratings to guide your decision. The
response with the lower Overall Quality score should not be considered better than
the other. Double check that the response you select aligns with the score given on
the SxS scale.
A response that partially satisfies the user’s request is and factual is ALWAYS
better or much better than a response that completely addresses the prompt but is
entirely false.

If both responses are missing the final response, then SxS is a 4

SxS Justification

Explain your reasoning behind the SxS score you chose. Remember, you should only
reference the tool code or tool output if truthfulness was a factor in deciding
your score.

Example:

@Response 2 is better than @Response 1 because @Response 2 gives the user the
answer to their mathematical equation while also pointing out the major highlights
of the response using bolded words. Both responses answer the user's prompt, but
@Response 2 provides a better, more understandable response and gives the user the
option to ask another question by ending the response with "Would you like to
explore another problem or concept related to complex numbers or the FFT".
@Response 2 has a thorough explanation of the equation but highlights the key
takeaways, which the user would find beneficial. @Response 1 provides the same
answer as @Response 2, however @Response 1 has a more complex explanation that the
user may find not as clear and harder to understand.ˇ

You might also like