ZiangWu-77

🎯

Focusing

Ziang Wu ZiangWu-77

🎯

Focusing

14 followers · 221 following

Peking University
Shenzhen
07:24 (UTC +08:00)

Achievements

Lists (9)

Sort

Stars

15 results for source starred repositories written in Cuda

Clear filter

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 28,460 3,338 Updated Jun 26, 2025

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 9,045 889 Updated Dec 24, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 8,828 1,036 Updated Dec 24, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,996 779 Updated Dec 23, 2025

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 3,016 220 Updated Dec 9, 2025

thu-ml / SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 2,920 291 Updated Dec 22, 2025

computerhistory / AlexNet-Source-Code

This package contains the original 2012 AlexNet code.

Cuda 2,795 360 Updated Mar 12, 2025

thu-ml / SpargeAttn

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

Cuda 858 73 Updated Dec 17, 2025

AlibabaResearch / flash-llm

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Cuda 230 22 Updated Sep 24, 2023

OpenBMB / CPM.cu

CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…

Cuda 212 21 Updated Oct 10, 2025

infinigence / FlashOverlap

A lightweight design for computation-communication overlap.

Cuda 200 9 Updated Oct 10, 2025

KuangjuX / NVSHMEM-Tutorial

NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer

Cuda 150 14 Updated Sep 18, 2025

NVIDIA / online-softmax

Benchmark code for the "Online normalizer calculation for softmax" paper

Cuda 103 10 Updated Jul 27, 2018

ivan-chai / torch-linear-assignment

Batch computation of the linear assignment problem on GPU.

Cuda 97 22 Updated Sep 16, 2025

ParCoreLab / CPU-Free-model

Source code for the CPU-Free model - a fully autonomous execution model for multi-GPU applications that completely excludes the involvement of the CPU beyond the initial kernel launch.

Cuda 22 3 Updated Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ziang Wu ZiangWu-77

Achievements

Achievements

Block or report ZiangWu-77

Lists (9)

🔨AI infra

📍Amazing Tools

🚀Efficient llm

🚀Efficient video or image gen

🍓great course

🚏Interesting Work

🔥MoE

🚀 My stack

瓜

Stars

karpathy / llm.c

xlite-dev / LeetCUDA

deepseek-ai / DeepEP

deepseek-ai / DeepGEMM

HazyResearch / ThunderKittens

thu-ml / SageAttention

computerhistory / AlexNet-Source-Code

thu-ml / SpargeAttn

AlibabaResearch / flash-llm

OpenBMB / CPM.cu

infinigence / FlashOverlap

KuangjuX / NVSHMEM-Tutorial

NVIDIA / online-softmax

ivan-chai / torch-linear-assignment

ParCoreLab / CPU-Free-model