Stars
Benchmarking Language Agents Under Controllable and Extreme Context Growth
[ICLR'26 Oral] LLM DNA: Tracing Model Evolution via Functional Representations
CL-bench: A Benchmark for Context Learning
Get 10X more out of Claude Code, Codex or any coding agent
Complete Claude Code configuration collection - agents, skills, hooks, commands, rules, MCPs. Battle-tested configs from an Anthropic hackathon winner.
A cross-platform desktop All-in-One assistant tool for Claude Code, Codex, OpenCode & Gemini CLI.
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
本文原文由知名 Hacker Eric S. Raymond 所撰寫,教你如何正確的提出技術問題並獲得你滿意的答案。
Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthr…
SkillsBench evaluates how well skills work and how effective agents are at using them
Tile-Based Runtime for Ultra-Low-Latency LLM Inference
Trae Agent is an LLM-based agent for general purpose software engineering tasks.
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
A curated list of awesome skills, hooks, slash-commands, agent orchestrators, applications, and plugins for Claude Code by Anthropic
SWE-agent takes a GitHub issue and tries to automatically fix it, using your LM of choice. It can also be employed for offensive cybersecurity or competitive coding challenges. [NeurIPS 2024]
LiveBench: A Challenging, Contamination-Free LLM Benchmark
Repo for "AlphaResearch: Accelerating New Algorithm Discovery with Language Models"
An Open Phone Agent Model & Framework. Unlocking the AI Phone for Everyone
User-friendly AI Interface (Supports Ollama, OpenAI API, ...)