04 Mar 26

An LLM benchmark testing the model’s push back (or lack thereof) against BS.


15 Sep 25

This leaderboard ranks LLMs based on their performance in Valyrian Games competitions. Models are ranked using the TrueSkill rating system, which accounts for win/loss records and the relative skill of opponents.



04 Sep 25

Measures how effectively various LLMs can infer a narrow or specific “theme” (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.

Tags:

A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance.

Tags:

17 Nov 20

Comparing the developer experience in terms of time and resource usage of performing clean installs of Oracle and PostgreSQL.