Skip to content
View ZiangWu-77's full-sized avatar
🎯
Focusing
🎯
Focusing
  • Peking University
  • Shenzhen
  • 07:24 (UTC +08:00)

Block or report ZiangWu-77

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
15 results for source starred repositories written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 28,460 3,338 Updated Jun 26, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 9,045 889 Updated Dec 24, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,828 1,036 Updated Dec 24, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,996 779 Updated Dec 23, 2025

Tile primitives for speedy kernels

Cuda 3,016 220 Updated Dec 9, 2025

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 2,920 291 Updated Dec 22, 2025

This package contains the original 2012 AlexNet code.

Cuda 2,795 360 Updated Mar 12, 2025

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

Cuda 858 73 Updated Dec 17, 2025

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Cuda 230 22 Updated Sep 24, 2023

CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…

Cuda 212 21 Updated Oct 10, 2025

A lightweight design for computation-communication overlap.

Cuda 200 9 Updated Oct 10, 2025

NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer

Cuda 150 14 Updated Sep 18, 2025

Benchmark code for the "Online normalizer calculation for softmax" paper

Cuda 103 10 Updated Jul 27, 2018

Batch computation of the linear assignment problem on GPU.

Cuda 97 22 Updated Sep 16, 2025

Source code for the CPU-Free model - a fully autonomous execution model for multi-GPU applications that completely excludes the involvement of the CPU beyond the initial kernel launch.

Cuda 22 3 Updated Apr 25, 2024