Stars
A python module to repair invalid JSON from LLMs
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA (+ more DSLs)
This repository contains the toolkit for replicating results from our technical report.
Official repo of Toucan: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments
A scalable asynchronous reinforcement learning implementation with in-flight weight updates.
An Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models
Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.
The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
An extremely fast Python linter and code formatter, written in Rust.
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, GLM4.5, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, Llava, GLM4v, Ph…
Weave is a toolkit for developing AI-powered applications, built by Weights & Biases.
Post-training with Tinker
accompanying material for sleep-time compute paper
Supporting code for the blog post on modular manifolds.
Research code artifacts for Code World Model (CWM) including inference tools, reproducibility, and documentation.
A sophisticated multi-step reasoning pipeline powered by the Datarus-R1-14B-Preview model
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike static benchmarks, this platform introduces evolving environment…
[ICLR 2025] Automated Design of Agentic Systems
[NeurIPS 2025 Spotlight] Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Training LLMs to reason and analyze data with notebooks