Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
-
Updated
Sep 8, 2024 - Cuda
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
Algorithms implemented in CUDA + resources about GPGPU
bilibili视频【CUDA 12.x 并行编程入门(C++版)】配套代码
CUDA kernel functions
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Harness the power of GPU acceleration for fusing visual odometry and IMU data with an advanced Unscented Kalman Filter (UKF) implementation. Developed in C++ and utilizing CUDA, cuBLAS, and cuSOLVER, this system offers unparalleled real-time performance in state and covariance estimation for robotics and autonomous system applications.
A MNIST handwritten digit classifier written from scratch in Cuda - C
Lab exercise of Parallel Processing course in NTUA regarding CUDA programming
Matrix Exponential Approximation using CUDA
A CUBLAS‐CUDA Based Implementation of Multi-GPU Large Matrix Multiplication
Add a description, image, and links to the cublas topic page so that developers can more easily learn about it.
To associate your repository with the cublas topic, visit your repo's landing page and select "manage topics."