29 Oct 25
24 Oct 25
15 Sep 25
This leaderboard ranks LLMs based on their performance in Valyrian Games competitions. Models are ranked using the TrueSkill rating system, which accounts for win/loss records and the relative skill of opponents.
by cos
5 months ago
https://docs.google.com/spreadsheets/d/1cchzP9dlTZ3WXQTfYNhh3avxoLipqHN75v1Tb86uhHo/edit?gid=0#gid=0
Benchmark of smartphone interaction for LLMs and multi-agent systems
by cos
5 months ago
04 Sep 25
Measures how effectively various LLMs can infer a narrow or specific “theme” (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.
by cos
5 months ago
A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance.
by cos
5 months ago