- Santa Clara
-
12:26
(UTC -07:00) - kunwu.me
- https://orcid.org/0000-0002-0149-1409
- in/kun-wu-069a14105
- https://go.kunwu.me/wakatime
Highlights
Lists (1)
Sort Name ascending (A-Z)
Stars
how to optimize some algorithm in cuda.
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …
A simple high performance CUDA GEMM implementation.
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
Instructions, Docker images, and examples for Nsight Compute and Nsight Systems
PyTorch-Based Fast and Efficient Processing for Various Machine Learning Applications with Diverse Sparsity
A tool for examining GPU scheduling behavior.
A repository where GPU applications are aggregated using a common build flow that supports multiple CUDA versions.
Graphiler is a compiler stack built on top of DGL and TorchScript which compiles GNNs defined using user-defined functions (UDFs) into efficient execution plans.
High-Performance Streaming Graph Analytics on GPUs
HeteroSync is a benchmark suite for performing fine-grained synchronization on tightly coupled GPUs
Custom SpMM operations integrated into PyTorch
A class project for CS508 from UIUC to implement the classic BLAST algorithm using different GPU computing techniques.
Detect Strongly Connected Components in a directed graph with low average degree using parallel programming.
Matrix multiplication on CUDA