Skip to content
View jason-huang03's full-sized avatar
  • Tsinghua University
  • Beijing, China

Organizations

@thu-nics @thu-ml

Block or report jason-huang03

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
37 stars written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 28,121 3,283 Updated Jun 26, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,711 984 Updated Nov 6, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,383 829 Updated Nov 6, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,871 739 Updated Oct 15, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 4,044 561 Updated Nov 10, 2025

Tile primitives for speedy kernels

Cuda 2,877 194 Updated Nov 9, 2025

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 2,650 259 Updated Nov 6, 2025

how to optimize some algorithm in cuda.

Cuda 2,610 237 Updated Nov 7, 2025

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,186 172 Updated Jul 29, 2023

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 967 99 Updated Dec 30, 2024

Fast CUDA matrix multiplication from scratch

Cuda 934 139 Updated Sep 2, 2025

Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)

Cuda 897 325 Updated Aug 19, 2024

Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch

Cuda 889 179 Updated Jul 19, 2023

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

Cuda 764 65 Updated Nov 10, 2025

CUDA Kernel Benchmarking Library

Cuda 761 90 Updated Oct 21, 2025

UNet diffusion model in pure CUDA

Cuda 654 31 Updated Jun 28, 2024

Causal depthwise conv1d in CUDA, with a PyTorch interface

Cuda 640 135 Updated Oct 20, 2025

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 493 86 Updated Sep 8, 2024

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 445 47 Updated May 14, 2025

Fastest kernels written from scratch

Cuda 386 52 Updated Sep 18, 2025

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Cuda 328 29 Updated Jul 2, 2024

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 271 22 Updated Jul 16, 2025

An implementation of the transformer architecture onto an Nvidia CUDA kernel

Cuda 194 12 Updated Sep 24, 2023

A lightweight design for computation-communication overlap.

Cuda 183 8 Updated Oct 10, 2025

A quantization algorithm for LLM

Cuda 146 8 Updated Jun 21, 2024

NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer

Cuda 142 11 Updated Sep 18, 2025

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.

Cuda 124 7 Updated May 10, 2025
Next