Stars
RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgets
CORAL is a robust, lightweight infrastructure for multi-agent autonomous self-evolution, built for autoresearch. Works with Claude Code, Codex, Cursor, OpenCode, Kiro, and more.
A Difficulty-Calibrated Benchmark for Building Terminal Agents
Super basic implementation (gist-like) of RLMs with REPL environments.
Harbor is a framework for running agent evaluations and creating and using RL environments.
[COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
[ICLR 2026] End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning
Optimize prompts, code, and more with AI-powered Reflective Text Evolution
Open-source implementation of AlphaEvolve
Recovery-Bench is a benchmark for evaluating the capability of LLM agents to recover from mistakes
Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflo…
Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.
NPUEval is an LLM evaluation dataset written specifically to target AIE kernel code generation on RyzenAI hardware.
MCP server integrating GEPA (Genetic-Evolutionary Prompt Architecture) for automatic prompt optimization with Claude Desktop
Renderer for the harmony response format to be used with gpt-oss
slime is an LLM post-training framework for RL Scaling.
Trajectories for running OpenHands on Terminal Bench
A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.
[NeurIPS '25] GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
A benchmark for LLMs on complicated tasks in the terminal
Sky-T1: Train your own O1 preview model within $450