Stars
Arena-Hard-Auto: An automatic LLM benchmark.
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
Ring attention implementation with flash attention
A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).
depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search (NeurIPS 2024)
Automatically split your PyTorch models on multiple GPUs for training & inference
Processed / Cleaned Data for Paper Copilot
《Reinforcement Learning: An Introduction》(第二版)中文翻译
Unofficial PyTorch implementation of Denoising Diffusion Probabilistic Models
Explore the Multimodal “Aha Moment” on 2B Model
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
A recipe for online RLHF and online iterative DPO.
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Trainable fast and memory-efficient sparse attention
[ICML 2024] CLLMs: Consistency Large Language Models
Implementation of Denoising Diffusion Probabilistic Models in PyTorch
[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule
AlignProp uses direct reward backpropogation for the alignment of large-scale text-to-image diffusion models. Our method is 25x more sample and compute efficient than reinforcement learning methods…
Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)
rl from zero pretrain, can it be done? yes.