Stars
A benchmark for testing whether LLM judges keep the same preference when two lightly edited versions of the same story are shown in opposite orders.
A multi-agent benchmark where eight LLMs play a money-driven elimination game with private transfers and a buyout endgame, and are ranked by final wealth
LLM Persuasion Benchmark tests whether one language model can change another model’s stated position over the course of a multi-turn conversation. It runs round-robin persuasion dialogues on contes…
Adversarial multi-turn benchmark for LLM debate quality, using side-swapped matchups and multi-model judging to rank models by judged debate performance.
This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, motivations, etc.) in a short creative story
Documents the style side of the short-story Creative Writing LLM benchmark: we generated many short stories with a range of LLMs, then analyzed those stories for stylistic fingerprints and within-m…
A benchmark for conversational bargaining by language models. In each 20‑round match one LLM plays buyer, one plays seller, and both hold a hidden private value. Every round they swap a short publi…
The BAZAAR challenges LLMs to navigate the double-auction marketplace, where buyers and sellers must make strategic decisions with incomplete information. Each agent receives a private value and mu…
Systemic, uninstructed collusion among frontier LLMs in a simulated bidding environment
Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies among Large Language Models (LLMs) in a resource-sharing econ…
Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a…
Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which ite…
LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each other or to 50 initial random words.
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation …
Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
ASL libraries will be migrated here in the stlab namespace, new libraries will be created here.
Angle Project (https://code.google.com/p/angleproject/) with support for Windows Store Apps (WinRT)
The JavaScript library for modern SVG graphics.