Skip to content
View hhy3's full-sized avatar

Organizations

@milvus-io

Block or report hhy3

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

37 stars written in Cuda
Clear filter

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 10,808 1,091 Updated Apr 20, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,578 1,208 Updated Apr 28, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 7,126 946 Updated Apr 24, 2026

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,331 401 Updated Jan 17, 2026

how to optimize some algorithm in cuda.

Cuda 2,952 272 Updated Apr 22, 2026

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

Cuda 2,231 200 Updated Apr 27, 2026

cuGraph - RAPIDS Graph Analytics Library

Cuda 2,166 350 Updated Apr 28, 2026

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Cuda 1,827 462 Updated Oct 9, 2023

Fast CUDA matrix multiplication from scratch

Cuda 1,158 178 Updated Sep 2, 2025

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,125 111 Updated Dec 30, 2024

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 1,002 231 Updated Apr 28, 2026

cuVS - a library for vector search and clustering on the GPU

Cuda 737 183 Updated Apr 28, 2026

Fastest kernels written from scratch

Cuda 575 74 Updated Sep 18, 2025

Static suckless single batch CUDA-only qwen3-0.6B mini inference engine

Cuda 553 48 Updated Sep 8, 2025

FlashKDA: high-performance Kimi Delta Attention kernels

Cuda 399 30 Updated Apr 22, 2026

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Cuda 382 47 Updated Jul 10, 2025

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Cuda 338 32 Updated Jul 2, 2024

mHC kernels implemented in CUDA

Cuda 263 20 Updated Mar 9, 2026

CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…

Cuda 239 22 Updated Jan 14, 2026

Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda

Cuda 232 23 Updated Dec 12, 2023

From Minimal GEMM to Everything

Cuda 201 10 Updated Feb 10, 2026

CUDA implementation of Hierarchical Navigable Small World Graph algorithm

Cuda 175 27 Updated Apr 19, 2021

GGNN: State of the Art Graph-based GPU Nearest Neighbor Search

Cuda 173 28 Updated Feb 11, 2025

FP64 equivalent GEMM by the Ozaki scheme with Int8 Tensor Cores

Cuda 120 7 Updated Dec 2, 2025

Benchmark code for the "Online normalizer calculation for softmax" paper

Cuda 110 10 Updated Jul 27, 2018

DeeperGEMM: crazy optimized version

Cuda 86 Updated May 5, 2025

🌈 Solutions of LeetGPU

Cuda 84 11 Updated Apr 10, 2026

Efficient and unified implementations for TopK-based sparse attention

Cuda 35 Updated Apr 20, 2026

High-Throughput, Cost-Effective Billion-Scale Vector Search with a Single GPU [SIGMOD'26]

Cuda 23 5 Updated Apr 22, 2026
Next