24 Oct 25
15 Sep 25
This leaderboard ranks LLMs based on their performance in Valyrian Games competitions. Models are ranked using the TrueSkill rating system, which accounts for win/loss records and the relative skill of opponents.
by cos
4 months ago
https://docs.google.com/spreadsheets/d/1cchzP9dlTZ3WXQTfYNhh3avxoLipqHN75v1Tb86uhHo/edit?gid=0#gid=0
Benchmark of smartphone interaction for LLMs and multi-agent systems
by cos
4 months ago
11 Sep 25
04 Sep 25
Measures how effectively various LLMs can infer a narrow or specific “theme” (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.
by cos
5 months ago
A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance.
by cos
5 months ago