Skip to content
View wiwa's full-sized avatar

Highlights

  • Pro

Block or report wiwa

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

22 stars written in Cuda
Clear filter

A massively parallel, optimal functional runtime in Rust

Cuda 11,235 437 Updated Nov 21, 2024

DeepEP: an efficient expert-parallel communication library

Cuda 9,564 1,203 Updated Apr 24, 2026

CUDA Library Samples

Cuda 2,380 457 Updated Apr 20, 2026

cuGraph - RAPIDS Graph Analytics Library

Cuda 2,166 350 Updated Apr 24, 2026

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,125 111 Updated Dec 30, 2024

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 1,002 231 Updated Apr 24, 2026

Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch

Cuda 947 182 Updated Jul 19, 2023

cuVS - a library for vector search and clustering on the GPU

Cuda 736 183 Updated Apr 27, 2026
Cuda 638 107 Updated Apr 23, 2026

Fast k nearest neighbor search using GPU

Cuda 546 111 Updated Aug 6, 2018

State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.

Cuda 454 30 Updated Dec 14, 2024

CUDA Data Parallel Primitives Library

Cuda 438 97 Updated Nov 9, 2018

A simple GPU hash table implemented in CUDA using lock free techniques

Cuda 406 44 Updated Feb 7, 2024

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

Cuda 382 33 Updated Mar 18, 2026

HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on h…

Cuda 201 35 Updated Apr 12, 2026

Efficient Top-K implementation on the GPU

Cuda 192 25 Updated Apr 9, 2019

Custom PTX Instruction Benchmark

Cuda 139 11 Updated Feb 27, 2025

GPUfs - File system support for NVIDIA GPUs

Cuda 104 43 Updated Nov 26, 2018
Cuda 89 7 Updated Oct 22, 2024

RAPIDS Accelerator JNI For Apache Spark

Cuda 56 80 Updated Apr 25, 2026

Harmonia is an algorithm that allows for the implementation of operations on B+ trees using parallelization. As a part of my GPU project, I implemented the Harmonia paper published in 2019 in CUDA.

Cuda 31 Updated Aug 8, 2021

Comparison of regression line calculation in CUDA GPU code vs. AVX-512 code

Cuda 1 1 Updated Apr 15, 2022