04 Mar 26
An LLM benchmark testing the model’s push back (or lack thereof) against BS.
29 Oct 25
24 Oct 25
15 Sep 25
This leaderboard ranks LLMs based on their performance in Valyrian Games competitions. Models are ranked using the TrueSkill rating system, which accounts for win/loss records and the relative skill of opponents.
Benchmark of smartphone interaction for LLMs and multi-agent systems
04 Sep 25
Measures how effectively various LLMs can infer a narrow or specific “theme” (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.
A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance.
05 May 25
03 Mar 21
17 Nov 20
Comparing the developer experience in terms of time and resource usage of performing clean installs of Oracle and PostgreSQL.