Stars
PTX ISA 9.1 documentation converted to searchable markdown. Includes Claude Code skill for CUDA development.
Unofficial description of the CUDA assembly (SASS) instruction sets.
A plug-and-play compiler that delivers free-lunch optimizations for both inference and training.
My Python scripts to make high-quality figures for publications in top AI conferences and journals.
A curated list of projects related to the reMarkable tablet
CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA te…
A lightweight design for computation-communication overlap.
A Datacenter Scale Distributed Inference Serving Framework
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
FlashInfer: Kernel Library for LLM Serving
FlashMLA: Efficient Multi-head Latent Attention Kernels
Github mirror of trition-lang/triton repo.
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
Multi-Faceted AI Agent and Workflow Autotuning. Automatically optimizes LangChain, LangGraph, DSPy programs for better quality, lower execution latency, and lower execution cost. Also has a simple …
Translation of C++ Core Guidelines [https://github.com/isocpp/CppCoreGuidelines] into Simplified Chinese.
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
Make a personal website using Notion and GitHub Pages
CUDA Templates and Python DSLs for High-Performance Linear Algebra
Paper collections of retrieval-based (augmented) language model.
Universal cross-platform tokenizers binding to HF and sentencepiece
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training