jason-huang03

Follow

Haofeng Huang jason-huang03

Follow

Undergraduate student from IIIS (Yao Class), Tsinghua University | Training Framework | Kernel | Model Arch & GenAI

180 followers · 33 following

Tsinghua University
Beijing, China

Achievements

Achievements

Organizations

Stars

37 stars written in Cuda

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 28,121 3,283 Updated Jun 26, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 8,711 984 Updated Nov 6, 2025

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,383 829 Updated Nov 6, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,871 739 Updated Oct 15, 2025

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 4,044 561 Updated Nov 10, 2025

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 2,877 194 Updated Nov 9, 2025

thu-ml / SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 2,650 259 Updated Nov 6, 2025

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 2,610 237 Updated Nov 7, 2025

Infatoshi / cuda-course

Cuda 2,033 384 Updated Nov 3, 2025

Liu-xiandong / How_to_optimize_in_GPU

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,186 172 Updated Jul 29, 2023

tspeterkim / flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 967 99 Updated Dec 30, 2024

siboehm / SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

Cuda 934 139 Updated Sep 2, 2025

olcf / cuda-training-series

Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)

Cuda 897 325 Updated Aug 19, 2024

CoffeeBeforeArch / cuda_programming

Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch

Cuda 889 179 Updated Jul 19, 2023

thu-ml / SpargeAttn

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

Cuda 764 65 Updated Nov 10, 2025

NVIDIA / nvbench

CUDA Kernel Benchmarking Library

Cuda 761 90 Updated Oct 21, 2025

clu0 / unet.cu

UNet diffusion model in pure CUDA

Cuda 654 31 Updated Jun 28, 2024

Dao-AILab / causal-conv1d

Causal depthwise conv1d in CUDA, with a PyTorch interface

Cuda 640 135 Updated Oct 20, 2025

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 493 86 Updated Sep 8, 2024

66RING / tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 445 47 Updated May 14, 2025

pranjalssh / fast.cu

Fastest kernels written from scratch

Cuda 386 52 Updated Sep 18, 2025

efeslab / Atom

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Cuda 328 29 Updated Jul 2, 2024

usyd-fsalab / fp6_llm

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 271 22 Updated Jul 16, 2025

ColfaxResearch / cutlass-kernels

Cuda 243 37 Updated Jul 11, 2024

linjames0 / Transformer-CUDA

An implementation of the transformer architecture onto an Nvidia CUDA kernel

Cuda 194 12 Updated Sep 24, 2023

infinigence / FlashOverlap

A lightweight design for computation-communication overlap.

Cuda 183 8 Updated Oct 10, 2025

nicolaswilde / cuda-tensorcore-hgemm

Cuda 156 25 Updated Dec 26, 2024

ByteDance-Seed / decoupleQ

A quantization algorithm for LLM

Cuda 146 8 Updated Jun 21, 2024

KuangjuX / NVSHMEM-Tutorial

NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer

Cuda 142 11 Updated Sep 18, 2025

xlite-dev / HGEMM

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.

Cuda 124 7 Updated May 10, 2025