Skip to content

Conversation

@HuanzhiMao
Copy link
Collaborator

@HuanzhiMao HuanzhiMao commented Mar 31, 2024

This PR is for the Leaderboard April 1 update.
This update comes with new models (Claude-3-Haiku, Databrick-DBRX-Instruct), more advanced AST evaluation process, and updated evaluation datasets. Cost and latency statistics during evaluation are also measured. We also released the manual that our evaluation is based on.

Does this affect leaderboard score?
Yes! Read updated blog 8 - leaderboard to learn more!


Co-authored-by: Charlie Cheng-Jie Ji charliechengjieji@berkeley.edu
Co-authored-by: Fanjia Yan fanjiayan@berkeley.edu

@HuanzhiMao HuanzhiMao changed the title Leaderboard V2 release Leaderboard Update April 1 Apr 1, 2024
Copy link
Owner

@ShishirPatil ShishirPatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShishirPatil ShishirPatil merged commit 6971033 into ShishirPatil:main Apr 1, 2024
@HuanzhiMao HuanzhiMao mentioned this pull request Apr 9, 2024
ShishirPatil pushed a commit that referenced this pull request Apr 11, 2024
This PR is for the leaderboard April 8th release:

1. Fixed an oversight that was introduced in #299. For function-calling
(FC) models that cannot take `float` type in input, when the parameter
type is a `float`, the evaluation procedure will convert that type to
`number` in the model input and mention in the parameter description
that `This is a float type value.`. An additional field `format: float`
will also be included in the model input to make it clear about the
type.
2. Update the model handler for Claude, Mistral, and OSS to better parse
the model output. This is to patch the handler we released in #299, as
it sometimes fails to parse even though the model output is valid. This
affects only the prompting models; the FC models are unaffected.


This PR **DOES** change the leaderboard score. We will update the
leaderboard website shortly, in a different PR.

---------

Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu>
Co-authored-by: Fanjia Yan <fanjiayan@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants