Skip to content
View withlin's full-sized avatar
🧸
🧸
  • GuangZhou,China

Block or report withlin

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

12 stars written in Cuda
Clear filter

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 10,259 1,041 Updated Apr 12, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,117 1,148 Updated Apr 9, 2026

how to optimize some algorithm in cuda.

Cuda 2,915 267 Updated Apr 9, 2026

NCCL Tests

Cuda 1,485 363 Updated Mar 11, 2026

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,277 180 Updated Jul 29, 2023

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,118 111 Updated Dec 30, 2024

Efficient GPU kernels for block-sparse matrix multiplication and convolution

Cuda 1,065 198 Updated Jun 8, 2023

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 883 149 Updated Sep 26, 2025

hpc 教程,包含集合通信(mpi、nccl)、cuda 编程、向量化 SIMD、RDMA 通信等

Cuda 406 44 Updated Apr 7, 2026

mHC kernels implemented in CUDA

Cuda 259 20 Updated Mar 9, 2026

Distributed MoE in a Single Kernel [NeurIPS '25]

Cuda 245 33 Updated Apr 6, 2026

NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer

Cuda 177 14 Updated Feb 11, 2026