chlorell

🪴

Mateusz Bugaj chlorell

🪴

3 followers · 3 following

Ocean Planet Studios
Ancient Greece

Stars

lechmazur / position_bias

A benchmark for testing whether LLM judges keep the same preference when two lightly edited versions of the same story are shown in opposite orders.

14 Updated Jun 11, 2026

lechmazur / buyout_game

A multi-agent benchmark where eight LLMs play a money-driven elimination game with private transfers and a buyout endgame, and are ranked by final wealth

15 1 Updated May 27, 2026

lechmazur / persuasion

LLM Persuasion Benchmark tests whether one language model can change another model’s stated position over the course of a multi-turn conversation. It runs round-robin persuasion dialogues on contes…

30 1 Updated Mar 27, 2026

lechmazur / debate

Adversarial multi-turn benchmark for LLM debate quality, using side-swapped matchups and multi-model judging to rank models by judged debate performance.

20 1 Updated Jun 10, 2026

lechmazur / writing

This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, motivations, etc.) in a short creative story

388 10 Updated Jun 10, 2026

lechmazur / writing_styles

Documents the style side of the short-story Creative Writing LLM benchmark: we generated many short stories with a range of LLMs, then analyzed those stories for stylistic fingerprints and within-m…

24 2 Updated Dec 18, 2025

lechmazur / pact

A benchmark for conversational bargaining by language models. In each 20‑round match one LLM plays buyer, one plays seller, and both hold a hidden private value. Every round they swap a short publi…

44 1 Updated Jun 10, 2026

lechmazur / bazaar

The BAZAAR challenges LLMs to navigate the double-auction marketplace, where buyers and sellers must make strategic decisions with incomplete information. Each agent receives a private value and mu…

37 4 Updated Jul 30, 2025

lechmazur / emergent_collusion

Systemic, uninstructed collusion among frontier LLMs in a simulated bidding environment

18 1 Updated Jul 15, 2025

lechmazur / pgg_bench

Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies among Large Language Models (LLMs) in a resource-sharing econ…

41 2 Updated Apr 10, 2025

lechmazur / step_game

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a…

87 2 Updated Dec 9, 2025

lechmazur / generalization

Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which ite…

71 2 Updated Apr 16, 2026

lechmazur / divergent

LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each other or to 50 initial random words.

35 1 Updated Mar 20, 2025

lechmazur / deception

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation …

33 2 Updated Mar 20, 2025

lechmazur / nyt-connections

Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words

Python 228 8 Updated May 28, 2026

lechmazur / confabulations

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

HTML 247 9 Updated Aug 7, 2025

KhronosGroup / glTF

glTF – Runtime 3D Asset Delivery

HTML 7,749 1,193 Updated Jun 1, 2026

g-truc / gli

Forked from bagobor/gli

OpenGL Image (GLI)

C++ 590 131 Updated Apr 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly