linkhut

This leaderboard ranks LLMs based on their performance in Valyrian Games competitions. Models are ranked using the TrueSkill rating system, which accounts for win/loss records and the relative skill of opponents.

by cos 5 months ago

Tags:

AndroidWorld Leaderboard [last update: 28/08/2025)] - Google Sheets

https://docs.google.com/spreadsheets/d/1cchzP9dlTZ3WXQTfYNhh3avxoLipqHN75v1Tb86uhHo/edit?gid=0#gid=0

Benchmark of smartphone interaction for LLMs and multi-agent systems

by cos 5 months ago

Tags:

04 Sep 25

lechmazur/generalization: Thematic Generalization Benchmark

https://github.com/lechmazur/generalization

Measures how effectively various LLMs can infer a narrow or specific “theme” (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.

by cos 5 months ago

Tags:

lechmazur/step_game: Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure

https://github.com/lechmazur/step_game

A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance.

by cos 5 months ago