Skip to content
View hhy3's full-sized avatar
  • Hilbert Space
  • 04:13 (UTC +08:00)
  • LinkedIn in/zhwangcs

Organizations

@milvus-io

Block or report hhy3

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

30 stars written in Cuda
Clear filter

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,983 877 Updated Dec 4, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,816 1,033 Updated Dec 5, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,978 778 Updated Dec 8, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 4,306 606 Updated Dec 19, 2025

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 2,869 289 Updated Dec 11, 2025

how to optimize some algorithm in cuda.

Cuda 2,696 244 Updated Dec 6, 2025

cuGraph - RAPIDS Graph Analytics Library

Cuda 2,091 342 Updated Dec 19, 2025

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Cuda 1,809 464 Updated Oct 9, 2023

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,023 100 Updated Dec 30, 2024

Fast CUDA matrix multiplication from scratch

Cuda 980 148 Updated Sep 2, 2025

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 963 220 Updated Dec 19, 2025

cuVS - a library for vector search and clustering on the GPU

Cuda 597 147 Updated Dec 19, 2025

Static suckless single batch CUDA-only qwen3-0.6B mini inference engine

Cuda 535 45 Updated Sep 8, 2025

Fastest kernels written from scratch

Cuda 499 62 Updated Sep 18, 2025

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Cuda 360 38 Updated Jul 10, 2025

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Cuda 331 30 Updated Jul 2, 2024

Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda

Cuda 231 23 Updated Dec 12, 2023

CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…

Cuda 212 21 Updated Oct 10, 2025

CUDA implementation of Hierarchical Navigable Small World Graph algorithm

Cuda 170 29 Updated Apr 19, 2021

GGNN: State of the Art Graph-based GPU Nearest Neighbor Search

Cuda 167 27 Updated Feb 11, 2025

Benchmark code for the "Online normalizer calculation for softmax" paper

Cuda 103 10 Updated Jul 27, 2018

FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme

Cuda 99 5 Updated Dec 2, 2025

DeeperGEMM: crazy optimized version

Cuda 73 Updated May 5, 2025

🌈 Solutions of LeetGPU

Cuda 58 9 Updated Nov 12, 2025

A High-Throughput Multi-GPU System for Graph-Based Approximate Nearest Neighbor Search

Cuda 20 1 Updated Jul 22, 2025

High-Throughput, Cost-Effective Billion-Scale Vector Search with a Single GPU [to appear in SIGMOD'26]

Cuda 18 3 Updated Sep 26, 2025

A cross-modal vector index with fast construction on heterogeneous CPU-GPU environment. Published on DaMoN@SIGMOD 2025.

Cuda 15 Updated Jul 16, 2025

Adamas: Hadamard Sparse Attention for Efficient Long-context Inference

Cuda 10 1 Updated Nov 25, 2025
Cuda 8 2 Updated Sep 18, 2025