hhy3

Zihao Wang hhy3

Infra @deepseek-ai

73 followers · 79 following

Hilbert Space
04:13 (UTC +08:00)
in/zhwangcs

Achievements

x2 x3

Achievements

x2 x3

Organizations

Starred repositories

30 stars written in Cuda

Clear filter

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,983 877 Updated Dec 4, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 8,816 1,033 Updated Dec 5, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,978 778 Updated Dec 8, 2025

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 4,306 606 Updated Dec 19, 2025

thu-ml / SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 2,869 289 Updated Dec 11, 2025

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 2,696 244 Updated Dec 6, 2025

rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library

Cuda 2,091 342 Updated Dec 19, 2025

NVIDIA / cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Cuda 1,809 464 Updated Oct 9, 2023

tspeterkim / flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,023 100 Updated Dec 30, 2024

siboehm / SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

Cuda 980 148 Updated Sep 2, 2025

rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 963 220 Updated Dec 19, 2025

rapidsai / cuvs

cuVS - a library for vector search and clustering on the GPU

Cuda 597 147 Updated Dec 19, 2025

yassa9 / qwen600

Static suckless single batch CUDA-only qwen3-0.6B mini inference engine

Cuda 535 45 Updated Sep 8, 2025

pranjalssh / fast.cu

Fastest kernels written from scratch

Cuda 499 62 Updated Sep 18, 2025

mit-han-lab / Quest

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Cuda 360 38 Updated Jul 10, 2025

efeslab / Atom

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Cuda 331 30 Updated Jul 2, 2024

DeMoriarty / TorchPQ

Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda

Cuda 231 23 Updated Dec 12, 2023

OpenBMB / CPM.cu

CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…

Cuda 212 21 Updated Oct 10, 2025