Skip to content
View SuperCB's full-sized avatar
🏠
Working from home
🏠
Working from home
  • rednote-hilab
  • Beijing

Block or report SuperCB

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

30 stars written in Cuda
Clear filter

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,319 823 Updated Oct 17, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 4,018 558 Updated Nov 5, 2025

Tile primitives for speedy kernels

Cuda 2,865 191 Updated Nov 4, 2025

how to optimize some algorithm in cuda.

Cuda 2,596 235 Updated Oct 30, 2025

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Cuda 1,801 463 Updated Oct 9, 2023

NCCL Tests

Cuda 1,325 326 Updated Nov 3, 2025

Learn CUDA Programming, published by Packt

Cuda 1,210 262 Updated Dec 30, 2023

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,181 172 Updated Jul 29, 2023

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 951 217 Updated Nov 5, 2025

Fast CUDA matrix multiplication from scratch

Cuda 928 137 Updated Sep 2, 2025

Causal depthwise conv1d in CUDA, with a PyTorch interface

Cuda 635 133 Updated Oct 20, 2025

Distributed multigrid linear solver library on GPU

Cuda 614 162 Updated Oct 15, 2025

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 490 86 Updated Sep 8, 2024

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 442 47 Updated May 14, 2025

A simple GPU hash table implemented in CUDA using lock free techniques

Cuda 400 44 Updated Feb 7, 2024

Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial

Cuda 315 66 Updated Oct 27, 2025

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Cuda 222 22 Updated Sep 24, 2023

Efficient Top-K implementation on the GPU

Cuda 187 22 Updated Apr 9, 2019

HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on h…

Cuda 175 30 Updated Nov 2, 2025

PyTorch bindings for CUTLASS grouped GEMM.

Cuda 126 77 Updated May 29, 2025

Benchmark code for the "Online normalizer calculation for softmax" paper

Cuda 102 10 Updated Jul 27, 2018

play gemm with tvm

Cuda 92 11 Updated Jul 22, 2023

PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu

Cuda 76 5 Updated Dec 3, 2024

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

Cuda 68 7 Updated Sep 8, 2024

CUDA implementation of parallel radix sort using Blelloch scan

Cuda 65 15 Updated Feb 29, 2024

High Performance Grouped GEMM in PyTorch

Cuda 31 2 Updated May 10, 2022

A way to use cuda to accelerate top k algorithm

Cuda 30 7 Updated Jul 11, 2017

A CUDA kernel for NHWC GroupNorm for PyTorch

Cuda 21 3 Updated Nov 15, 2024

GPU TopK Benchmark

Cuda 16 3 Updated Dec 19, 2024

A fast, yet specialized, RMSNorm/LayerNorm implementation

Cuda 4 Updated May 25, 2024