-
Renmin University of China
- Beijing
Stars
A benchmark for LLMs on complicated tasks in the terminal
daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently
LLM-in-Sandbox Elicits General Agentic Intelligence
GRPO training code which scales to 32xH100s for long horizon terminal/coding tasks. Base agent is now the top Qwen3 agent on Stanford's TerminalBench leaderboard.
NexRL is an ultra-loosely-coupled LLM post-training framework.
Lightweight coding agent that runs in your terminal
Bash is all You need - Write a nano Claude Code 0 - 1
[COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
A survey of Code Agents / Foundation Models for improving development productivity. Become 10x SWE, MLE, etc.
Tongyi Deep Research, the Leading Open-source Deep Research Agent
Pioneering Automated GUI Interaction with Native Agents
A novel two-stage coarse-to-fine information-seeking method to enhance the multi-document question-answering capabilities of LLMs.
Collection of extracted System Prompts from popular chatbots like ChatGPT, Claude & Gemini
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
Examples and guides for using the OpenAI API
π Make websites accessible for AI agents. Automate tasks online with ease.
Evergreen, contamination-free, real-world, domain-specific AI evaluation framework
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Letta is the platform for building stateful agents: AI with advanced memory that can learn and self-improve over time.
π€ smolagents: a barebones library for agents that think in code.
R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning