Stars
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Triton Support in Compiler Explorer
FlashInfer: Kernel Library for LLM Serving
PKU-DAIR / Hetu-Galvatron
Forked from AFDWang/Hetu-GalvatronGalvatron is an automatic distributed training system designed for Transformer models, including Large Language Models (LLMs).
depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek-R1, Qwen3, Gemma 3, TTS 2x faster with 70% less VRAM.
kaldi-asr/kaldi is the official location of the Kaldi project.
Unified multidimensional array model that collects nonrectangular shapes, advanced indexing, views and sparsity into a single set of composable abstractions Resources
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs
A PyTorch native platform for training generative AI models
CUDA Templates and Python DSLs for High-Performance Linear Algebra
Training LLMs with QLoRA + FSDP
GPU programming related news and material links
Official inference library for Mistral models
Distributed Machine Learning Patterns from Manning Publications by Yuan Tang https://bit.ly/2RKv8Zo
Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)