Stars
Agentic RL on Any Harness at Scale
Benchmarking Open-Ended Inference Optimization by AI Agents
Oh my tmux! My self-contained, pretty & versatile tmux configuration made with 💛🩷💙🖤❤️🤍
Can Language Models Rebuild Programs From Scratch?
A benchmark for evaluating LLMs on open-ended CS problems. Exploring the Next Frontier of Computer Science.
AIDE: AI-Driven Exploration in the Space of Code. The machine Learning engineering agent that automates AI R&D.
🪨 why use many token when few token do trick — Claude Code skill that cuts 65% of tokens by talking like caveman
OpenSeeker: A search agent with open-source data and models
Production-grade engineering skills for AI coding agents.
Real-time AI assistant for Meta Ray-Ban smart glasses -- voice + vision + agentic actions via Gemini Live and OpenClaw
Train the smallest LM you can that fits in 16MB. Best model wins!
Meta-Harness: 76.4% on Terminal-Bench 2.0 (Claude Opus 4.6)
SkillsBench evaluates how well skills work and how effective agents are at using them.
Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.
My learning notes for ML SYS.
OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis
CL-bench: A Benchmark for Context Learning
OpenTinker is an RL-as-a-Service infrastructure for foundation models
[AAAI26]: DS SERVE: The Largest Open Vector Store over Pretain Data; A Framework for Efficient and Scalable Neural Retrieval
800,000 step-level correctness labels on LLM solutions to MATH problems
Code repo for "LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners"
A curated collection of 1000+ agent skills from official dev teams and the community, compatible with Claude Code, Codex, Gemini CLI, Cursor, and more.
Comprehensive open-source library of AI research and engineering skills for any AI model. Package the skills and your claude code/codex/gemini agent will be an AI research agent with full horsepowe…
Official JAX implementation of End-to-End Test-Time Training for Long Context