Skip to content
View yiliu30's full-sized avatar
🌍
Working on site
🌍
Working on site
  • AI Frameworks Engineer @intel
  • SH
  • 19:07 (UTC +08:00)

Block or report yiliu30

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
23 stars written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 28,459 3,338 Updated Jun 26, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 9,038 888 Updated Dec 24, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,829 1,036 Updated Dec 24, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,996 778 Updated Dec 23, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 4,347 614 Updated Dec 24, 2025

Tile primitives for speedy kernels

Cuda 3,013 220 Updated Dec 9, 2025

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 2,914 291 Updated Dec 22, 2025

how to optimize some algorithm in cuda.

Cuda 2,711 244 Updated Dec 23, 2025

Learn CUDA Programming, published by Packt

Cuda 1,218 262 Updated Dec 30, 2023

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,031 102 Updated Dec 30, 2024

Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)

Cuda 925 339 Updated Aug 19, 2024

Fastest kernels written from scratch

Cuda 500 61 Updated Sep 18, 2025

A simple high performance CUDA GEMM implementation.

Cuda 421 42 Updated Jan 4, 2024

Step-by-step optimization of CUDA SGEMM

Cuda 416 54 Updated Mar 30, 2022

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Cuda 331 30 Updated Jul 2, 2024

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 277 22 Updated Jul 16, 2025

A set of hands-on tutorials for CUDA programming

Cuda 243 35 Updated Apr 8, 2024
Cuda 214 172 Updated Aug 27, 2019

Distributed MoE in a Single Kernel [NeurIPS '25]

Cuda 157 18 Updated Dec 23, 2025

Benchmark code for the "Online normalizer calculation for softmax" paper

Cuda 103 10 Updated Jul 27, 2018