AI Benchmark Tools
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
A curated list of popular Datasets, Models and Papers for LLMs in Medical/Healthcare
A framework for few-shot evaluation of language models.
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
Verify Precision of all Kimi K2 API Vendor