-
Renmin University of China
- Beijing
Stars
A benchmark for LLMs on complicated tasks in the terminal
daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently
LLM-in-Sandbox Elicits General Agentic Intelligence
GRPO training code which scales to 32xH100s for long horizon terminal/coding tasks. Base agent is now the top Qwen3 agent on Stanford's TerminalBench leaderboard.
NexRL is an ultra-loosely-coupled LLM post-training framework.
Lightweight coding agent that runs in your terminal
Bash is all You need - Write a nano Claude Code 0 - 1
[COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
A survey of Code Agents / Foundation Models for improving development productivity. Become 10x SWE, MLE, etc.
Tongyi Deep Research, the Leading Open-source Deep Research Agent
Pioneering Automated GUI Interaction with Native Agents
A novel two-stage coarse-to-fine information-seeking method to enhance the multi-document question-answering capabilities of LLMs.
Collection of extracted System Prompts from popular chatbots like ChatGPT, Claude & Gemini
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
Examples and guides for using the OpenAI API
🌐 Make websites accessible for AI agents. Automate tasks online with ease.
Evergreen, contamination-free, real-world, domain-specific AI evaluation framework
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Letta is the platform for building stateful agents: AI with advanced memory that can learn and self-improve over time.
🤗 smolagents: a barebones library for agents that think in code.
R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning
🔍 Awesome Agentic Search is a curated list of papers, tools, and resources on agentic search—where AI agents plan, search, and reason to answer complex questions. Explore the latest research, bench…
DeepResearchAgent is a hierarchical multi-agent system designed not only for deep research tasks but also for general-purpose task solving. The framework leverages a top-level planning agent to coo…