-
10:41
(UTC -07:00)
Highlights
- Pro
Stars
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Accelerating MoE with IO and Tile-aware Optimizations
Helpful kernel tutorials, examples and SKILLs for tile-based GPU programming
Distributed Compiler based on Triton for Parallel Systems
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
slime is an LLM post-training framework for RL Scaling.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
A Survey of Reinforcement Learning for Large Reasoning Models
Code repo for efficient quantized MoE inference with mixture of low-rank compensators
Algorithms implementation in C++ and solutions of questions (both code and math proof) from “Introduction to Algorithms” (3e) (CLRS) in LaTeX.