From prompt to paste: evaluate AI / LLM output under a strict Python sandbox and get actionable scores across 7 categories, including security, correctness and upkeep.
-
Updated
Sep 20, 2025 - Python
From prompt to paste: evaluate AI / LLM output under a strict Python sandbox and get actionable scores across 7 categories, including security, correctness and upkeep.
Open-source framework for defining Page Language Models (PLMs) for intelligent app understanding and AI-assisted testing.
xVerify: Efficient Answer Verifier for Large Language Model Evaluations
Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words
BEHAVIOR-1K: a platform for accelerating Embodied AI research. Join our Discord for support: https://discord.gg/bccR5vGFEx
A validation and profiling tool for AI infrastructure
MTEB: Massive Text Embedding Benchmark
This repo investigates LLMs' tendency to exhibit acquiescence bias in sequential QA interactions. Includes evaluation methods, datasets, benchmarks, and experiment code to assess and mitigate vulnerabilities in conversational consistency and robustness, offering a reproducible framework for future research.
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.
Automatic download VPR datasets in a standard format
RAID is the largest and most challenging benchmark for AI-generated text detection. (ACL 2024)
Benchmarks of Spring Boot REST service comparing Java 21 Virtual Threads (Project Loom) with WebFlux (Project Reactor).
OpenCUA: Open Foundations for Computer-Use Agents
[NeurIPS 2025] Official implementation for the paper "SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning"
Add a description, image, and links to the benchmark topic page so that developers can more easily learn about it.
To associate your repository with the benchmark topic, visit your repo's landing page and select "manage topics."