inspect_evals to AVID report#9
Conversation
| report = Report() | ||
|
|
||
| report.affects = Affects( | ||
| developer=[], |
There was a problem hiding this comment.
for this report, developer should be OpenAI, so programmatically we need to parse the first part of openai/gpt-4o-mini, the do a key-value search from a dict of the form name: human-readable-name to append the human readable name here
There was a problem hiding this comment.
It is a bit difficult to have the extraction generalizable because for example one can use model created by MetaAI hosted via AzureAI (Microsoft).
There was a problem hiding this comment.
For example: inspect eval --model azureai/llama-2-70b-chat-wnsnw
There was a problem hiding this comment.
Those are special cases where deployer can be populated in place of developer. we can even choose to populate only deployer. In any case, splitting the name and where it is hosted/developed is the right thing to do to fit the schema, as compared to not splitting
| artifacts=[ | ||
| Artifact( | ||
| type=ArtifactTypeEnum.model, | ||
| name=eval_log.eval.model |
There was a problem hiding this comment.
this should be only gpt-4o-mini
| type=TypeEnum.measurement, | ||
| description=LangValue( | ||
| lang='eng', | ||
| value=eval_log.eval.task |
There was a problem hiding this comment.
this should be a canned sentence of the form f"Evaluation of the LLM {model_name} on the {benchmark} benchmark using Inspect Evals"
| lang='eng', | ||
| value=f"Sample input: {sample.input}\n" | ||
| f"Model output: {sample.output}\n" | ||
| f"Score: {sample.score}" |
There was a problem hiding this comment.
this is a good structure. to set the context, can you
- start with a canned description of the benchmark (if it's there in the logs), or just the canned sentence in problemtype.
- add a field
f"Scorer: {scorer_description}\n"before the score, so the reader knows what the score signifies
| Reference( | ||
| type='source', | ||
| label='Inspect Evaluation Log', | ||
| url=file_path |
There was a problem hiding this comment.
when this report goes up on AVID the user is not able to access this log. Instead, plz point to the benchmark itself. e.g. if we're ingesting a report on BOLD, the url point to that module in inspect_evals that we're contributing, and the corresponding page in Inspect Evals docs
shubhobm
left a comment
There was a problem hiding this comment.
Code is functional! a few comments on organization
|
@shubhobm please take a look now |
No description provided.