Skip to content
View jason-huang03's full-sized avatar
  • Tsinghua University
  • Beijing, China

Organizations

@thu-nics @thu-ml

Block or report jason-huang03

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
37 stars written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 28,097 3,267 Updated Jun 26, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,697 976 Updated Nov 6, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,348 827 Updated Nov 6, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,865 738 Updated Oct 15, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 4,025 560 Updated Nov 7, 2025

Tile primitives for speedy kernels

Cuda 2,871 192 Updated Nov 7, 2025

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 2,633 259 Updated Nov 6, 2025

how to optimize some algorithm in cuda.

Cuda 2,603 237 Updated Nov 7, 2025

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,184 172 Updated Jul 29, 2023

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 961 97 Updated Dec 30, 2024

Fast CUDA matrix multiplication from scratch

Cuda 931 138 Updated Sep 2, 2025

Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)

Cuda 890 323 Updated Aug 19, 2024

Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch

Cuda 886 179 Updated Jul 19, 2023

CUDA Kernel Benchmarking Library

Cuda 761 90 Updated Oct 21, 2025

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

Cuda 759 65 Updated Nov 7, 2025

UNet diffusion model in pure CUDA

Cuda 653 31 Updated Jun 28, 2024

Causal depthwise conv1d in CUDA, with a PyTorch interface

Cuda 638 133 Updated Oct 20, 2025

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 491 86 Updated Sep 8, 2024

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 442 47 Updated May 14, 2025

Fastest kernels written from scratch

Cuda 386 52 Updated Sep 18, 2025

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Cuda 327 29 Updated Jul 2, 2024

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 271 22 Updated Jul 16, 2025

An implementation of the transformer architecture onto an Nvidia CUDA kernel

Cuda 192 12 Updated Sep 24, 2023

A lightweight design for computation-communication overlap.

Cuda 183 8 Updated Oct 10, 2025

A quantization algorithm for LLM

Cuda 145 8 Updated Jun 21, 2024

NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer

Cuda 142 11 Updated Sep 18, 2025

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.

Cuda 124 7 Updated May 10, 2025
Next