K-Wu

Kun Wu K-Wu

Making the Stack Data-Efficient, Composable & Scalable!⚓@NVIDIA Backend Compiler Engineer⚓PhD (@illinois-impact)⚓BEng (Tsinghua)

216 followers · 303 following

Achievements

x2 x2

Achievements

x2 x2

Highlights

Developer Program Member
Pro

Organizations

Lists (1)

Sort

SparseOps

2 repositories

Stars

24 stars written in Cuda

Clear filter

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 2,953 272 Updated Apr 22, 2026

NVIDIA / nccl-tests

NCCL Tests

Cuda 1,503 367 Updated Apr 13, 2026

rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 1,002 231 Updated Apr 29, 2026

NVIDIA / nvbench

CUDA Kernel Benchmarking Library

Cuda 859 105 Updated Apr 22, 2026

cudpp / cudpp

CUDA Data Parallel Primitives Library

Cuda 438 97 Updated Nov 9, 2018

Cjkkkk / CUDA_gemm

A simple high performance CUDA GEMM implementation.

Cuda 434 41 Updated Jan 4, 2024

yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Cuda 415 52 Updated Jan 2, 2025

ZaidQureshi / bam

Cuda 222 71 Updated Mar 28, 2026

jundaf2 / INT8-Flash-Attention-FMHA-Quantization

Cuda 162 17 Updated Sep 15, 2023

wzsh / wmma_tensorcore_sample

Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)

Cuda 147 20 Updated Aug 18, 2020

cwpearson / nvidia-performance-tools

Instructions, Docker images, and examples for Nsight Compute and Nsight Systems

Cuda 136 22 Updated May 19, 2020

dgSPARSE / dgSPARSE-Lib

PyTorch-Based Fast and Efficient Processing for Various Machine Learning Applications with Diverse Sparsity

Cuda 121 28 Updated Apr 27, 2026

yalue / cuda_scheduling_examiner_mirror

A tool for examining GPU scheduling behavior.

Cuda 96 23 Updated Aug 17, 2024

accel-sim / gpu-app-collection

A repository where GPU applications are aggregated using a common build flow that supports multiple CUDA versions.

Cuda 93 61 Updated Apr 14, 2026

EBD-CREST / nsparse

Sparse matrix computation library for GPU

Cuda 59 14 Updated Jul 12, 2020

xiezhq-hermann / graphiler

Graphiler is a compiler stack built on top of DGL and TorchScript which compiles GNNs defined using user-defined functions (UDFs) into efficient execution plans.

Cuda 59 6 Updated Oct 3, 2022

hornet-gt / hornetsnest

High-Performance Streaming Graph Analytics on GPUs

Cuda 35 15 Updated Jan 28, 2019

mattsinc / heterosync

HeteroSync is a benchmark suite for performing fine-grained synchronization on tightly coupled GPUs

Cuda 32 6 Updated Sep 19, 2024

smoorjani / matrix-multiplication

Custom SpMM operations integrated into PyTorch

Cuda 11 Updated Apr 15, 2022

developer-onizuka / gpudirect_storage

Cuda 9 3 Updated Oct 14, 2023

c5shen / GPU-BLAST-plus

A class project for CS508 from UIUC to implement the classic BLAST algorithm using different GPU computing techniques.

Cuda 9 Updated May 16, 2023

lenLRX / AmpereMicroBench

Cuda 8 Updated Jul 10, 2022

Amit104 / DetectingSCC

Detect Strongly Connected Components in a directed graph with low average degree using parallel programming.

Cuda 4 2 Updated Apr 30, 2018

lifrankfan / CUDA-General-Matrix-Multiplication

Matrix multiplication on CUDA

Cuda 1 Updated May 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kun Wu K-Wu

Achievements

Achievements

Highlights

Organizations

Block or report K-Wu

Lists (1)

SparseOps

Stars

BBuf / how-to-optim-algorithm-in-cuda

NVIDIA / nccl-tests

rapidsai / raft

NVIDIA / nvbench

cudpp / cudpp

Cjkkkk / CUDA_gemm

yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

ZaidQureshi / bam

jundaf2 / INT8-Flash-Attention-FMHA-Quantization

wzsh / wmma_tensorcore_sample

cwpearson / nvidia-performance-tools

dgSPARSE / dgSPARSE-Lib

yalue / cuda_scheduling_examiner_mirror

accel-sim / gpu-app-collection

EBD-CREST / nsparse

xiezhq-hermann / graphiler

hornet-gt / hornetsnest

mattsinc / heterosync

smoorjani / matrix-multiplication

developer-onizuka / gpudirect_storage

c5shen / GPU-BLAST-plus

lenLRX / AmpereMicroBench

Amit104 / DetectingSCC

lifrankfan / CUDA-General-Matrix-Multiplication