CUDA matrix multiplication, reduction, and softmax kernels optimized for my RTX 4070 in C++17
-
Updated
Oct 17, 2025 - Cuda
CUDA matrix multiplication, reduction, and softmax kernels optimized for my RTX 4070 in C++17
A C++ header-only library for parallel linear algebra on GPUs (CUDA/cuBLAS under the hood)
Add a description, image, and links to the cplusplus-17 topic page so that developers can more easily learn about it.
To associate your repository with the cplusplus-17 topic, visit your repo's landing page and select "manage topics."