Lists (1)
Sort Name ascending (A-Z)
Starred repositories
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
FlashInfer: Kernel Library for LLM Serving
how to optimize some algorithm in cuda.
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Learn CUDA Programming, published by Packt
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …
Causal depthwise conv1d in CUDA, with a PyTorch interface
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
flash attention tutorial written in python, triton, cuda, cutlass
A simple GPU hash table implemented in CUDA using lock free techniques
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on h…
Benchmark code for the "Online normalizer calculation for softmax" paper
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
CUDA implementation of parallel radix sort using Blelloch scan
High Performance Grouped GEMM in PyTorch
A CUDA kernel for NHWC GroupNorm for PyTorch
A fast, yet specialized, RMSNorm/LayerNorm implementation