Skip to content
View K-Wu's full-sized avatar

Organizations

@NVIDIA @eesast @llvm @illinois-impact

Block or report K-Wu

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
24 stars written in Cuda
Clear filter

how to optimize some algorithm in cuda.

Cuda 2,953 272 Updated Apr 22, 2026

NCCL Tests

Cuda 1,503 367 Updated Apr 13, 2026

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 1,002 231 Updated Apr 29, 2026

CUDA Kernel Benchmarking Library

Cuda 859 105 Updated Apr 22, 2026

CUDA Data Parallel Primitives Library

Cuda 438 97 Updated Nov 9, 2018

A simple high performance CUDA GEMM implementation.

Cuda 434 41 Updated Jan 4, 2024

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Cuda 415 52 Updated Jan 2, 2025
Cuda 222 71 Updated Mar 28, 2026

Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)

Cuda 147 20 Updated Aug 18, 2020

Instructions, Docker images, and examples for Nsight Compute and Nsight Systems

Cuda 136 22 Updated May 19, 2020

PyTorch-Based Fast and Efficient Processing for Various Machine Learning Applications with Diverse Sparsity

Cuda 121 28 Updated Apr 27, 2026

A tool for examining GPU scheduling behavior.

Cuda 96 23 Updated Aug 17, 2024

A repository where GPU applications are aggregated using a common build flow that supports multiple CUDA versions.

Cuda 93 61 Updated Apr 14, 2026

Sparse matrix computation library for GPU

Cuda 59 14 Updated Jul 12, 2020

Graphiler is a compiler stack built on top of DGL and TorchScript which compiles GNNs defined using user-defined functions (UDFs) into efficient execution plans.

Cuda 59 6 Updated Oct 3, 2022

High-Performance Streaming Graph Analytics on GPUs

Cuda 35 15 Updated Jan 28, 2019

HeteroSync is a benchmark suite for performing fine-grained synchronization on tightly coupled GPUs

Cuda 32 6 Updated Sep 19, 2024

Custom SpMM operations integrated into PyTorch

Cuda 11 Updated Apr 15, 2022

A class project for CS508 from UIUC to implement the classic BLAST algorithm using different GPU computing techniques.

Cuda 9 Updated May 16, 2023
Cuda 8 Updated Jul 10, 2022

Detect Strongly Connected Components in a directed graph with low average degree using parallel programming.

Cuda 4 2 Updated Apr 30, 2018

Matrix multiplication on CUDA

Cuda 1 Updated May 21, 2023