Skip to content

BrowseComp/HLE Reproducibility | BrowseComp/HLE 结果可复现性 #87

@panademo

Description

@panademo

Hi Zhipu team, thank you so much for open-sourcing such impressive models and sharing your research!
I had 2 quick questions:

  1. How can the BrowseComp and HLE evaluation results be replicated? I noticed the two files "trajectory_search.json" and "glm_4.6_tir_guide.md" but are these enough to reproduce the results?
  2. Is the search-agent framework you used for BrowseComp/HLE evaluation open-source, or do you plan to open-source it?

Also, if I can also ask, what the agent framework was used to achieve the SWE-Bench Verified results?

Thanks again for your great work! 🙏


嗨,智谱团队,非常感谢你们开源如此优秀的模型并分享研究成果!
我有两个简短的问题:

  1. BrowseComp 和 HLE 的评测结果该如何复现?我注意到仓库里有两个文件 “trajectory_search.json”“glm_4.6_tir_guide.md”,但它们是否足以复现你们报告的结果呢?
  2. 用于 BrowseComp/HLE 评测的 搜索代理框架 是否已经开源,或者未来有开源计划吗?

另外,如果可以的话,我还想请教一下,你们在 SWE-Bench Verified 上取得结果时使用的代理框架是什么?

再次感谢你们的卓越工作!🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions