Starred repositories
UniRL is a Framework for Unified Multimodal Model Reinforcement Learning
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
Official Repo of "D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models"
A single CLAUDE.md file to improve Claude Code behavior, derived from Andrej Karpathy's observations on LLM coding pitfalls.
NVIDIA AITune is an inference toolkit designed for tuning and deploying Deep Learning models with a focus on NVIDIA GPUs.
A plug-and-play compiler that delivers free-lunch optimizations for both inference and training.
Bash is all you need - A nano claude code–like 「agent harness」, built from 0 to 1
The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.
rCM & Causal-rCM: Leading and Unified Algorithms/Infrastructures for Bidirectional/Autoregressive Video Diffusion Distillation at Scale
TurboDiffusion: 100–200× Acceleration for Video Diffusion Models
A framework for efficient model inference with omni-modality models
flex-block-attn: an efficient block sparse attention computation library
Transforming Video Diffusion with Temporal Sparse Attention
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
torchcomms: a modern PyTorch communications API
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training
Aiming to integrate most existing feature caching-based diffusion acceleration schemes into a unified framework.
Lightweight Image Video Action Generation Inference Framework
Trainable fast and memory-efficient sparse attention
FlagGems is an operator library for large language models implemented in the Triton Language.
A PyTorch-native inference engine with cache, parallelism, quantization and cpu offload for DiTs.
(CVPR 2025) From Slow Bidirectional to Fast Autoregressive Video Diffusion Models