Skip to content
View hhy3's full-sized avatar

Organizations

@milvus-io

Block or report hhy3

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

34 stars written in Cuda
Clear filter

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 10,236 1,038 Updated Apr 8, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,108 1,149 Updated Apr 9, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,320 858 Updated Mar 22, 2026

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,290 393 Updated Jan 17, 2026

how to optimize some algorithm in cuda.

Cuda 2,914 267 Updated Apr 9, 2026

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

Cuda 2,186 193 Updated Apr 10, 2026

cuGraph - RAPIDS Graph Analytics Library

Cuda 2,160 348 Updated Apr 10, 2026

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Cuda 1,826 463 Updated Oct 9, 2023

Fast CUDA matrix multiplication from scratch

Cuda 1,127 172 Updated Sep 2, 2025

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,115 111 Updated Dec 30, 2024

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 994 228 Updated Apr 10, 2026

cuVS - a library for vector search and clustering on the GPU

Cuda 730 179 Updated Apr 10, 2026

Fastest kernels written from scratch

Cuda 565 71 Updated Sep 18, 2025

Static suckless single batch CUDA-only qwen3-0.6B mini inference engine

Cuda 553 48 Updated Sep 8, 2025

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Cuda 380 45 Updated Jul 10, 2025

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Cuda 336 31 Updated Jul 2, 2024

mHC kernels implemented in CUDA

Cuda 259 20 Updated Mar 9, 2026

CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…

Cuda 236 22 Updated Jan 14, 2026

Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda

Cuda 232 23 Updated Dec 12, 2023

From Minimal GEMM to Everything

Cuda 195 10 Updated Feb 10, 2026

CUDA implementation of Hierarchical Navigable Small World Graph algorithm

Cuda 175 27 Updated Apr 19, 2021

GGNN: State of the Art Graph-based GPU Nearest Neighbor Search

Cuda 171 28 Updated Feb 11, 2025

FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme

Cuda 117 7 Updated Dec 2, 2025

Benchmark code for the "Online normalizer calculation for softmax" paper

Cuda 109 10 Updated Jul 27, 2018

DeeperGEMM: crazy optimized version

Cuda 86 Updated May 5, 2025

🌈 Solutions of LeetGPU

Cuda 82 11 Updated Apr 10, 2026

High-Throughput, Cost-Effective Billion-Scale Vector Search with a Single GPU [to appear in SIGMOD'26]

Cuda 21 5 Updated Jan 16, 2026

A High-Throughput Multi-GPU System for Graph-Based Approximate Nearest Neighbor Search

Cuda 21 1 Updated Jul 22, 2025

A cross-modal vector index with fast construction on heterogeneous CPU-GPU environment. Published on DaMoN@SIGMOD 2025.

Cuda 16 Updated Jul 16, 2025
Next