yiliu30

🌍

Working on site

Yi Liu yiliu30

🌍

Working on site

Talk is cheap, pick one and do it.

24 followers · 175 following

AI Frameworks Engineer @intel
SH
19:07 (UTC +08:00)

Achievements

x3 x3

Achievements

x3 x3

Lists (5)

Sort

Stars

23 stars written in Cuda

Clear filter

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 28,459 3,338 Updated Jun 26, 2025

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 9,038 888 Updated Dec 24, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 8,829 1,036 Updated Dec 24, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,996 778 Updated Dec 23, 2025

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 4,347 614 Updated Dec 24, 2025

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 3,013 220 Updated Dec 9, 2025

thu-ml / SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 2,914 291 Updated Dec 22, 2025

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 2,711 244 Updated Dec 23, 2025

PacktPublishing / Learn-CUDA-Programming

Learn CUDA Programming, published by Packt

Cuda 1,218 262 Updated Dec 30, 2023

tspeterkim / flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,031 102 Updated Dec 30, 2024

olcf / cuda-training-series

Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)

Cuda 925 339 Updated Aug 19, 2024

baidu-research / baidu-allreduce

Cuda 600 112 Updated Apr 6, 2018

pranjalssh / fast.cu

Fastest kernels written from scratch

Cuda 500 61 Updated Sep 18, 2025

deeperlearning / professional-cuda-c-programming

Cuda 480 170 Updated Jul 5, 2015

Cjkkkk / CUDA_gemm

A simple high performance CUDA GEMM implementation.

Cuda 421 42 Updated Jan 4, 2024

wangzyon / NVIDIA_SGEMM_PRACTICE

Step-by-step optimization of CUDA SGEMM

Cuda 416 54 Updated Mar 30, 2022

efeslab / Atom

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Cuda 331 30 Updated Jul 2, 2024

usyd-fsalab / fp6_llm

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 277 22 Updated Jul 16, 2025

puttsk / cuda-tutorial

A set of hands-on tutorials for CUDA programming

Cuda 243 35 Updated Apr 8, 2024

torch / cunn

Cuda 214 172 Updated Aug 27, 2019

osayamenja / FlashMoE

Distributed MoE in a Single Kernel [NeurIPS '25]

Cuda 157 18 Updated Dec 23, 2025

sunlex0717 / DissectingTensorCores

Cuda 110 21 Updated Apr 19, 2024

NVIDIA / online-softmax

Benchmark code for the "Online normalizer calculation for softmax" paper

Cuda 103 10 Updated Jul 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yi Liu yiliu30

Achievements

Achievements

Block or report yiliu30

Lists (5)

👍

LLM

MLIR

🚀 My stack

Quant

Stars

karpathy / llm.c

xlite-dev / LeetCUDA

deepseek-ai / DeepEP

deepseek-ai / DeepGEMM

flashinfer-ai / flashinfer

HazyResearch / ThunderKittens

thu-ml / SageAttention

BBuf / how-to-optim-algorithm-in-cuda

PacktPublishing / Learn-CUDA-Programming

tspeterkim / flash-attention-minimal

olcf / cuda-training-series

baidu-research / baidu-allreduce

pranjalssh / fast.cu

deeperlearning / professional-cuda-c-programming

Cjkkkk / CUDA_gemm

wangzyon / NVIDIA_SGEMM_PRACTICE

efeslab / Atom

usyd-fsalab / fp6_llm

puttsk / cuda-tutorial

torch / cunn

osayamenja / FlashMoE

sunlex0717 / DissectingTensorCores

NVIDIA / online-softmax