This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,181 172 Updated Jul 29, 2023

rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 951 217 Updated Nov 5, 2025

siboehm / SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

Cuda 928 137 Updated Sep 2, 2025

Dao-AILab / causal-conv1d

Causal depthwise conv1d in CUDA, with a PyTorch interface

Cuda 635 133 Updated Oct 20, 2025

NVIDIA / AMGX

Distributed multigrid linear solver library on GPU

Cuda 614 162 Updated Oct 15, 2025

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 490 86 Updated Sep 8, 2024

66RING / tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 442 47 Updated May 14, 2025

nosferalatu / SimpleGPUHashTable

A simple GPU hash table implemented in CUDA using lock free techniques

Cuda 400 44 Updated Feb 7, 2024

FZJ-JSC / tutorial-multi-gpu

Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial

Cuda 315 66 Updated Oct 27, 2025

AlibabaResearch / flash-llm

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Cuda 222 22 Updated Sep 24, 2023

anilshanbhag / gpu-topk

Efficient Top-K implementation on the GPU

Cuda 187 22 Updated Apr 9, 2019

NVIDIA-Merlin / HierarchicalKV

HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on h…

Cuda 175 30 Updated Nov 2, 2025