Lists (3)
Sort Name ascending (A-Z)
Starred repositories
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepEP: an efficient expert-parallel communication library
FlashInfer: Kernel Library for LLM Serving
how to optimize some algorithm in cuda.
Flash Attention in ~100 lines of CUDA (forward pass only)
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
flash attention tutorial written in python, triton, cuda, cutlass
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
gevtushenko / llm.c
Forked from karpathy/llm.cLLM training in simple, raw C/CUDA
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
Flash Attention in raw Cuda C beating PyTorch
Kernels for attention and other diffusion specific tasks.
This project optimizes multi-GPU parallelism for machine learning training by accelerating multi-GPU using fused gradient buffers, NCCL AllReduce, and CUDA C kernel-level optimizations including me…
terrelln / dietgpu
Forked from facebookresearch/dietgpuGPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.